It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
https://openai.com/searchbot.json
I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
$ curl -I https://www.cloudflare.com
HTTP/2 200
$ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
HTTP/2 403(the site may occasionally fail to load)
https://www.merklemap.com/search?query=ycombinator.com&page=...
Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).
Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):
P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.
It's certainly news to me, and presumably some others, that this exists.
If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could/should have been submitted instead of "A public log intended for consumption exists, and a company is consuming that log". This post would do literally nothing to enlighten you about CT logs.
If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.
Way more interesting reads for people unfamiliar with what certificate transparency is, in my opinion, than this "OpenAI read my CT log" post:
https://googlechrome.github.io/CertificateTransparency/log_s...
Oh, I read this as indicating OpenAI may make a move into the security space.
Gross.
Perhaps you should save your "gross" judgement for when you better understand what's happening?
For example:
curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)Needs clarification. What reason
I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
Then all they know is the main domain, and you can somewhat hide in obscurity.
https://letsencrypt.org/2025/12/02/from-90-to-45#making-auto...
Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.
The whole purpose of this data is to be consumed by 3rd-parties.
But what GP probably meant is that OAI definitely uses this log to get a list of new websites in order to scrap then later. This is a pretty standard way to use CT logs - you get a list of domains to scrap instead of relying solely on hyperlinks.
drwhyandhow•3h ago