GPT Bot Ignoring Robots.txt on my cloudflare worker

3•white_viel•2h ago

TLDR: GPT Bot is systematically accessing my private ubuntu mirror, ignoring the robots.txt

Today in the morning I woke up to the following message from Cloudflare about my quota usage on Cloudflare workers

>> Your account has reached 75% of its daily requests limit for Cloudflare Workers and/or Pages Functions

This is unusual as only have one worker on my Cloudflare account that proxies my apt repos for my personal PC to specific upstream services. As much as the domain is public, it is not posted anywhere and only used for my home PCs.

So i get the Cloudflare worker logs and see about 160k requests in the last 24 hours, up from barely 24(yes 24 in total) to various packaged via my proxy.

Extracted part of the logs is as below

>> { >> "headers": { >> "accept": "/", >> "accept-encoding": "gzip, br", >> "cf-connecting-ip": "74.7.227.53", >> "cf-ipcountry": "US", >> "cf-ray": "9d388b074b38d3be", >> "cf-visitor": "{"scheme":"https"}", >> "connection": "Keep-Alive", >> "from": "gptbot(at)openai.com", >> "host": "XXXXXXXXXXXXXXXXX.brotich.workers.dev", >> "referer": "https://XXXXXXXXXXXXXXXXX.brotich.workers.dev/ubuntu/pool/universe/z/zephyr/", >> "user-agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)", >> "x-forwarded-proto": "https", >> "x-openai-host-hash": "103003167", >> "x-real-ip": "74.7.227.53" >> } >> }

as you can see, the request is from GPTBot that collect training data.

Now the annoying bit: - according to openapi, they respect robots.txt. I have this set up on my domain as follows

>>> # BEGIN Cloudflare Managed content >>> >>> User-agent: * >>> Content-Signal: search=yes,ai-train=no >>> Allow: / >>> >>> User-agent: Amazonbot >>> Disallow: / >>> >>> User-agent: Applebot-Extended >>> Disallow: / >>> >>> User-agent: Bytespider >>> Disallow: / >>> >>> User-agent: CCBot >>> Disallow: / >>> >>> User-agent: ClaudeBot >>> Disallow: / >>> >>> User-agent: Google-Extended >>> Disallow: / >>> >>> User-agent: GPTBot >>> Disallow: / >>> >>> User-agent: meta-externalagent >>> Disallow: / >>> >>> # END Cloudflare Managed Content

This is just a hobby project, and I have put safeguards on Cloudflare to prevent scarping by bot. there is nothing of value in there. it's just a proxy for my own use.

why say you respect robots.txt if you dont?

Comments

white_viel•1h ago

will be serving a zip bomb to the bot to see if they stay away from my proxy

white_viel•25m ago

serving a zip bomb and after 10 minutes, the traffic from the gpt bot disappeared..

white_viel•1m ago

update: the bot is back now with a vengeance, sending request at about 1 request per second. ignoring robots.txt and the status code 403

Temple of boom: Why Taiwan's religious sites are becoming unlikely rave venues

Greenland Sharks Defy Aging

Mobile phone short video useimpacts attention functions: an EEG study

Show HN: I ported Tree-sitter to Go

Intelligence: A History

Data Scanning and the Fourth Amendment [pdf]

Canadian Tire data breach exposed almost 42M records

Forking Zed to orchestrate headless coding agent fleets

The Slow Death of the Power User

Woxi: Wolfram Mathematica Reimplementation in Rust

My AI kept lying to me, so I built a stress test for agents

CO2 Is the Wrong Number: Greenhouse Gas Equivalents for Road Freight

Show HN: ATA – open-source terminal research agent for keeping up with papers

Three games to illustrate societal failures

Lambda: The Ultimate GOTO (1977)

A tool for (Go) code clone detection

Ask HN: Should you include a list of technologies in your CV?

Show HN: Tentacle – Local-first note taking app that organizes itself

Show HN: I built an AI senior architect – vibe coding meets system design

Disabled woman put in nursing home against her will says she feels 'betrayed'

Show HN: I ported Manim to TypeScript (run 3b1B math animations in the browser)

Fredrick Brennan, founder of 8chan, has died

Hacker used Anthropic's Claude chatbot to attack government agencies in Mexico

Ralph-code – Structured autonomous coding loop with Claude Code and Codex

The Appeal and Reality of Recycling LoRAs with Adaptive Merging

A formal proof that a tax system can function without compliance decisions

What Makes People Proud of Their Country?

Show HN: Agent that matches sales reps with warm leads based on product usage

West Virginia's Anti-Apple CSAM Lawsuit Would Help Child Predators Walk Free

Respecting maintainer time should be in security policies