Today in the morning I woke up to the following message from Cloudflare about my quota usage on Cloudflare workers
>> Your account has reached 75% of its daily requests limit for Cloudflare Workers and/or Pages Functions
This is unusual as only have one worker on my Cloudflare account that proxies my apt repos for my personal PC to specific upstream services. As much as the domain is public, it is not posted anywhere and only used for my home PCs.
So i get the Cloudflare worker logs and see about 160k requests in the last 24 hours, up from barely 24(yes 24 in total) to various packaged via my proxy.
Extracted part of the logs is as below
>> { >> "headers": { >> "accept": "/", >> "accept-encoding": "gzip, br", >> "cf-connecting-ip": "74.7.227.53", >> "cf-ipcountry": "US", >> "cf-ray": "9d388b074b38d3be", >> "cf-visitor": "{"scheme":"https"}", >> "connection": "Keep-Alive", >> "from": "gptbot(at)openai.com", >> "host": "XXXXXXXXXXXXXXXXX.brotich.workers.dev", >> "referer": "https://XXXXXXXXXXXXXXXXX.brotich.workers.dev/ubuntu/pool/universe/z/zephyr/", >> "user-agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)", >> "x-forwarded-proto": "https", >> "x-openai-host-hash": "103003167", >> "x-real-ip": "74.7.227.53" >> } >> }
as you can see, the request is from GPTBot that collect training data.
Now the annoying bit: - according to openapi, they respect robots.txt. I have this set up on my domain as follows
>>> # BEGIN Cloudflare Managed content >>> >>> User-agent: * >>> Content-Signal: search=yes,ai-train=no >>> Allow: / >>> >>> User-agent: Amazonbot >>> Disallow: / >>> >>> User-agent: Applebot-Extended >>> Disallow: / >>> >>> User-agent: Bytespider >>> Disallow: / >>> >>> User-agent: CCBot >>> Disallow: / >>> >>> User-agent: ClaudeBot >>> Disallow: / >>> >>> User-agent: Google-Extended >>> Disallow: / >>> >>> User-agent: GPTBot >>> Disallow: / >>> >>> User-agent: meta-externalagent >>> Disallow: / >>> >>> # END Cloudflare Managed Content
This is just a hobby project, and I have put safeguards on Cloudflare to prevent scarping by bot. there is nothing of value in there. it's just a proxy for my own use.
why say you respect robots.txt if you dont?
white_viel•1h ago