frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
163•theblazehen•2d ago•47 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
674•klaussilveira•14h ago•202 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
950•xnx•20h ago•552 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
123•matheusalmeida•2d ago•33 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
22•kaonwarb•3d ago•19 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
58•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
232•isitcontent•14h ago•25 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
225•dmpetrov•15h ago•118 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
332•vecti•16h ago•145 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
495•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
383•ostacke•20h ago•95 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
360•aktau•21h ago•182 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
289•eljojo•17h ago•175 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
413•lstoll•21h ago•279 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
32•jesperordrup•4h ago•16 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
20•bikenaga•3d ago•8 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
17•speckx•3d ago•7 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
63•kmm•5d ago•7 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
91•quibono•4d ago•21 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
258•i5heu•17h ago•196 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
32•romes•4d ago•3 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
44•helloplanets•4d ago•42 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
60•gfortaine•12h ago•26 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1070•cdrnsf•1d ago•446 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
36•gmays•9h ago•12 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•70 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
288•surprisetalk•3d ago•43 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
150•SerCe•10h ago•142 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
186•limoce•3d ago•100 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•14h ago•14 comments
Open in hackernews

It seems that OpenAI is scraping [certificate transparency] logs

https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3
215•pavel_lishin•1mo ago

Comments

drwhyandhow•1mo ago
This has been long the case! I think there whole business model is based off scraping lol
Aurornis•1mo ago
This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

jsheard•1mo ago
In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).

https://openai.com/searchbot.json

I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.

  $ curl -I https://www.cloudflare.com
  HTTP/2 200

  $ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403
Aurornis•1mo ago
Thanks for looking it up!
btown•1mo ago
I don't have a statistic here, but I'm always surprised how many websites I come across that do limited user-agent and origin/referrer checks, but don't maintain any kind of active IP based tracking. If you're trying to build a site-specific scraper and are getting blocked, mimicking headers is an easy and often sufficient step.
xyzzy_plugh•1mo ago
If you can't tell the difference between active tracking and inspecting request headers, then it's worth committing a bit of time to ponder. Particularly the costs associated with IP tracking at scale.
ccgreg•1mo ago
> Some search engines provide a list of their scraper IP ranges

Common Crawl's CCBot has published IP ranges. We aren't a search engine (although there are search engines using our data) and we like to describe our crawler as a crawler, not a "scraper".

bobsmooth•1mo ago
>The logical next step is for smaller AI players to present themselves as the largest players in the space.

We think we're so different from animals https://en.wikipedia.org/wiki/Mimicry

827a•1mo ago
Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.
H8crilA•1mo ago
For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com

(the site may occasionally fail to load)

Eikon•1mo ago
Shameless plug :)

https://www.merklemap.com/search?query=ycombinator.com&page=...

Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).

Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):

https://www.merklemap.com/search?query=ycombi&page=0

chuckadams•1mo ago
Any insights you can share on how you made search so fast? What kind of resources does it take to implement it?
Eikon•1mo ago
Most of merklemap is stored on ZeroFS [0] and thus allows to scale IO ressources quite crazily :)

[0] https://github.com/Barre/ZeroFS

jddj•1mo ago
> Watch Ubuntu boot from ZeroFS

Love it

rendaw•1mo ago
How does ZeroFS handle consistency with writes?
Eikon•1mo ago
If you use 9P or NBD it handles fsync as expected. With NFS, it's time based.

https://github.com/Barre/ZeroFS#9p-recommended-for-better-pe...

rendaw•1mo ago
Oh awesome! I was searching for consistency, but I guess durability is the word used for filesystems. Thanks!
nerdsniper•1mo ago
Thank you!!! Needed exactly this at work.
Eikon•1mo ago
Glad it was helpful!
TacticalCoder•1mo ago
Not 100% related but not 100% not-related either: I've got a script that generates variations of the domain names I use the most... All the most common typos/mispelling, all the "1337" variations, all the Levenhstein edit distance of 1, quite some of the 2, etc.

For example for "lillybank.com", I'll generate:

    llllybank.com
    liliybank.com
    ...
and countless others.

Hundreds of thousands of entries. They then are null-routed from my unbound DNS resolver.

My browsers are forced into "corporate" settings where they cannot use DoH/DoT: it's all, between my browsers and my unbound resolver, in the clear.

All DNS UDP traffic that contains any Unicode domain name is blocked by the firewall. No DNS over TCP is allowed (and, no, I don't care).

I also block entire countries' TLD as well as entire countries' IP blocks.

Been running a setup like that (and many killfiles, and DNS resolvers known to block all known porn and know malware sites etc.) since years now already. The Internet keeps working fine.

rendaw•1mo ago
The first page of results doesn't include ycombinator.com. I get `app.baby-ycombinator.com`, `ycombinator.comchat.com`, everything in between.

Substring doesn't seem like what I'd want in a subdomain search.

Eikon•1mo ago
> Substring doesn't seem like what I'd want in a subdomain search.

Well, if you want only subdomains search for *.ycombinator.com.

https://www.merklemap.com/search?query=*.ycombinator.com&pag...

1vuio0pswjnm7•1mo ago
Considering how it must be getting hammered what with the "AI" nonsense, it's interesting how crt.sh continues to remain usable, particularly the (limited) direct PostgresSQL db access

To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet

crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues

miki123211•1mo ago
It is not usable.

It's the only website I know of where queries can just randomly fail for no reason, and they don't even have an automatic retry mechanism. Even the worst enterprise nightmares I've seen weren't this user unfriendly.

pavel_lishin•1mo ago
What's the yawn for?
xpe•1mo ago
Presumably this is well-known among people that already know about this.

P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]

> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.

[1]: https://www.nature.com/articles/s41562-023-01719-1

jfindper•1mo ago
It implies that this is boring and not article/post-worthy (which I agree with).

Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.

pavel_lishin•1mo ago
> It implies that this is boring and not article/post-worthy (which I agree with).

It's certainly news to me, and presumably some others, that this exists.

jfindper•1mo ago
Which part is news?

If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could/should have been submitted instead of "A public log intended for consumption exists, and a company is consuming that log". This post would do literally nothing to enlighten you about CT logs.

If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.

Way more interesting reads for people unfamiliar with what certificate transparency is, in my opinion, than this "OpenAI read my CT log" post:

https://googlechrome.github.io/CertificateTransparency/log_s...

https://certificate.transparency.dev/

dylan604•1mo ago
> I feel like there are significantly more interesting articles

if this is the article that introduces someone to the concept of certificate transparency, then there's nothing wrong with that. graciously, you followed through with links to what you consider more interesting. that is not something a lot of commenters do and just leave it as a snarky comment for someone being one of the lucky 10000 for the day.

TeMPOraL•1mo ago
Yeah, this is the unspoken part about HTTPS: you enable it, you also announce to the entire world you're serving stuff from specific DNS names :).

(Which is why I hate it that it's so hard to test things locally without having to get a domain and a certificate. I don't want to buy domain names and announce them publicly for the sake of some random script that needs to offer a HTTP endpoint.)

Modern security is introducing a lot of unexpected couplings into software systems, including coupling to political, social and physical reality, which is surprising if you think in terms of programs you write, which most likely shouldn't have any such relationships.

My favorite example of such unexpected coupling, whose failures are still regularly experienced by users, is wall clock time. If your program touches anything related to certificates, even indirectly, suddenly it's coupled to actual real clock and your users better make sure their system time is in synch with the rest of the world, or else things will stop working.

imtringued•1mo ago
You do know that /etc/hosts is a file you can edit, right? You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.
TeMPOraL•1mo ago
> You do know that /etc/hosts is a file you can edit, right?

Yes. What does it have to do with HTTPS?

> You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.

Sorta, kinda. Does it actually work with third-party apps? Does it work with mobile systems? If not, then it's not a valid solution, because it doesn't allow me to run my stuff in my own networks without interfacing with the global Internet and social and political systems backing its cryptographic infrastructure.

JumpCrisscross•1mo ago
> Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting

Oh, I read this as indicating OpenAI may make a move into the security space.

prettyblocks•1mo ago
Even if it's just for their internal security initiatives it would make sense given how massive they are. Threat hunting via cert monitoring is very effective.
ozim•1mo ago
But it isn’t. Guy posted the fact they sent bot for scraping.

That’s not the intended use for CT logs.

moralestapia•1mo ago
Because it's hardly news in its context.
ekr____•1mo ago
With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.
827a•1mo ago
These exist for apex domains; the real use-case is subdomains.
Eikon•1mo ago
Merklemap offers that: https://www.merklemap.com/documentation/live-tail
agwa•1mo ago
There's an extension to static-ct-api, currently implemented by Sunlight logs, that provides a feed of just SANs and CNs: https://github.com/FiloSottile/sunlight/blob/main/names-tile...

For example:

  curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)
raldi•1mo ago
What reason?
electroly•1mo ago
The CT log tells you about new websites as soon as they come online. Good if you're intending to scrape the web.
1vuio0pswjnm7•1mo ago
"... for exacty this reason."

Needs clarification. What reason

TeMPOraL•1mo ago
Knowing what DNS names are actually used.

EDIT: that's the flip side of supporting HTTPS that's not well-known among developers - by acquiring a legitimate certificate for your service to enable HTTPS, you also announce to the entire world, through a public log, that your service exists.

thesuitonym•1mo ago
I don't really see how this is a flip-side. If you're putting something on the web, presumably you want it to be accessed by others, so this is actually a benefit.

If you didn't want others to access your service, maybe consider putting it in a private space.

aziaziazi•1mo ago
There’s usages of https that don’t overlap with "the (public) web".
tbrownaw•1mo ago
All of the internal stuff at $employer uses a private CA. I suspect this is fairly universal at places that aren't super tiny.
TeMPOraL•1mo ago
Problem is a lack of solutions that work at places that are tiny, such as a small company, or a household. This is yet another area of the computing ecosystem that forgets there are other use cases for computers than commerce.
1vuio0pswjnm7•1mo ago
s/exacty/exactly

"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"

The reason presented by the blog post is "for what I assume are things to scrape from"

Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs

After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)

But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties

Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2

X then writes an op-ed that suggests Z is using the phone book "for who to call"

This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling

Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.

Why does someone use these public resources. Online commenter: "To look up names and numbers"

Correct. But that alone is not very interesting. Why are they looking up the names and numbers

1.

We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions

With the telephone, X could ask "Why are you calling?"

With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence

No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share

Similarly no one knows what OpenAI will do with the data in the future

One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"

2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion

3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do with the data. In other words, removing speculation

kccqzy•1mo ago
Certificate transparency log is a Google project. They don’t need to scrape it. They host all the data. It’s one of those projects where Google hosts it because it thinks it genuinely improves the internet, by reducing certificate authority abuse.
tech234a•1mo ago
The Web Archive also uses the Certificate Transparency logs, some websites that aren't linked anywhere end up in the Wayback Machine this way: https://archive.org/details/certificate-transparency?tab=abo...
gmerc•1mo ago
Let's prompt inject it
mxlje•1mo ago
So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.
jcims•1mo ago
Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

benjojo12•1mo ago
the ip address the this comes from is a OpenAI search bot range:

> "ipv4Prefix": "74.7.175.128/25"

from https://openai.com/searchbot.json

Hrun0•1mo ago
Yes, it is very common to change your useragent for web scraping. Mainly because there are websites which will block you just based on that alone
snowwrestler•1mo ago
Right. Crawler user agent strings in general tend to include all sorts of legacy stuff for compatibility.

This actually is a well-behaved crawler user agent because it identifies itself at the end.

deathanatos•1mo ago
… the UA is malformed, even.

Makes me want to reconfigure my servers to just drop such traffic. If you can't be arsed to send a well-formed UA, I have doubts that you'll obey other conventions like robots.txt.

throwaway613745•1mo ago
OpenAI is scraping everything that is publicly accessible. Everything.
Aachen•1mo ago
Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt

I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it

immibis•1mo ago
There's no evidence the barrage of residentially-proxied bot accesses hitting every public website have anything to do with OpenAI, but then again, there's also no evidence they don't.
warkdarrior•1mo ago
So do Google, Microsoft/Bing, Yandex, etc. How else would they make sure their search/chatbot/q&a products are up to date?
_pdp_•1mo ago
I wonder if this can be used to contaminate OpenAI search indexes?
bombcar•1mo ago
If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

lysace•1mo ago
Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.
jsheard•1mo ago
Yep, but next year they intend to launch an alternative DNS challenge which doesn't require changing DNS records with every renewal. Instead you'll create a persistent TXT record containing a public key, and then any ACME client which has the private key can keep requesting new certs forever.

https://letsencrypt.org/2025/12/02/from-90-to-45#making-auto...

Ajedi32•1mo ago
Oh, sweet! I didn't know about this. I have no need of wildcard certs, but this will greatly simplify the process of issuing certificates for internal services behind my local firewall. No need to maintain an acme-dns server; just configure the ACME client, set the DNS record and you're done? Very nice.
8cvor6j844qw_d6•1mo ago
Great to hear, one less API keys needed for the DNS records.
cortesoft•1mo ago
If you are using a non-standard DNS provider that doesn’t have integration with certbot or cert-manager or whatever you are using, it is pretty easy to set up an acme-dns server to handle it

https://github.com/joohoi/acme-dns

Reventlov•1mo ago
also you can use https://github.com/krtab/agnos if you don't have any api access
Ajedi32•1mo ago
I hadn't heard of Agnos before, interesting alternative to ACME-DNS.

Looking at the README, is the idea that the certificates get generated on the DNS server itself? Not by the ACME client on each machine that needs a certificate? That seems like a confusing design choice to me. How do you get the certificate back to the web server that actually needs it? Or is the idea that you'd have a single server which acts as both the DNS server and the web server?

ls612•1mo ago
When I set up a wildcard cert for my homelab services it was easy to have Cloudflare give me an API token to do the DNS validation for LE.
vault•1mo ago
Correct, that's what I did with caddy, which is now periodically renewing my wildcard certificate through a DNS-01 challenge.
8cvor6j844qw_d6•1mo ago
May I know does Caddy automatically update with apt if you built custom Caddy binaries for the DNS provider plugin?

Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.

bityard•1mo ago
Yep, but this comes with a tradeoff: all of your services now have a valid key/cert for your whole domain, significantly increasing the blast radius if one service is compromised.
silverwind•1mo ago
Not a problem if you have the cert on a shared load balancer, not on the services directly.
0127•1mo ago
This is what we do for development containers/hosts - put them behind *.dev.example.com, allows us to hide most testing instances using a shared load balancer. And with a single wildcard CNAME, No info is leaked in CT logs or DNS. Said LB is firewalled, but why pay for extra traffic that's just going to be blocked?
nh2•1mo ago
Is it technically possible to obtain a wildcard cert from LetsEncrypt, but then use OpenSSL / X.509 tooling to derive a restricted cert/key to be deployed on servers, which only works for specific domains under the wildcard?
alphager•1mo ago
No
xpe•1mo ago
Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:

    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum

* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.

47282847•1mo ago
I agree with your analysis but try to not agree with your conclusion, purely for my own metal hygiene: I believe one can retrain the pattern matching of one’s brain for happier outcomes. If I let my brain judge this as a “failure“ (judgment “it is wrong“), I will either get sad about it (judgment “… and I can’t change it“) or angry (… and I can do something about it“). In cases such as this I prefer to accept it as is, so I try to rewrite my brain rule to consider it a necessary part of life (judgment “true/good/correct“).
xpe•1mo ago
Ah, in case it didn't come across clearly, my conclusion isn't to blame the individuals. My assessment is to seek out better communication patterns, which is partly about "technology" and partly about culture (expectations). People could indeed learn not to act this way with a bit of subtle nudging, feedback, and mechanism design.

I'm also pretty content accepting the unpleasant parts of reality without spin or optimism. Sometimes the better choice is still crappy, after all ;) I think Oliver Burkeman makes a fun and thoughtful case in "The Antidote: Happiness for People Who Can't Stand Positive Thinking" https://www.goodreads.com/book/show/13721709-the-antidote

matt3210•1mo ago
Your content is stolen for training the moment you put it up
jfindper•1mo ago
It is an _incredible_ stretch to frame certificate transparency logs as "content" in the creative sense.

The whole purpose of this data is to be consumed by 3rd-parties.

integralid•1mo ago
I don't see issue with OAI scraping public logs.

But what GP probably meant is that OAI definitely uses this log to get a list of new websites in order to scrap then later. This is a pretty standard way to use CT logs - you get a list of domains to scrap instead of relying solely on hyperlinks.

advisedwang•1mo ago
matt3210 clearly means that the content of the website (revealed by the CT log) is what is being stolen, not the data in the CT log
ang_cire•1mo ago
I think their point is that the people registering certs may not intend their sites to be immediately scraped, but now OpenAI is bypassing e.g. google indexing or web spidering, and using your cert provider's CT entries to find you immediately for scraping.
0xdeadbeefbabe•1mo ago
It would be funny if your content disappeared when it was stolen.
prepend•1mo ago
If I give my content away for free, it can’t be stolen.

The point of putting up a public web site is so the public can view it (including OpenAI/google/etc).

If I don’t want people viewing it, then I don’t make it public.

Saying that things are stolen when they aren’t clouds the issue.

poormathskills•1mo ago
Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?
LeifCarrotson•1mo ago
The ostensible purpose of the certificate transparency logs is to allow validation of a certificate you're looking at - I browse to https://poormathskills.com and want to figure out the details of when its cert was issued.

The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.

mh-•1mo ago
Unintended: agreed. Unexpected: plenty of us called out this inevitability when the CT proposal was circulated.
rcxdude•1mo ago
The main thing isn't validating the cert you're looking at, per se, it's to validate the activities of the issuers. Mainly that they aren't issuing certificates they aren't supposed to (i.e. you can monitor the logs for your domain to check some random CA you've never done business hasn't issued a cert for it). This is mainly enforced by any violations (i.e. any certificates found that don't show up in the logs) being grounds for being removed from browser's trusted list.
8cvor6j844qw_d6•1mo ago
Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?
toddgardner•1mo ago
If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs
kirito1337•1mo ago
yawn, i saw this more than 1000 times

privacy doesnt exist in this world

nephihaha•1mo ago
Of course it doesn't exist if you keep handing it away.
throwaway150•1mo ago
I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!

I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?

Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!

prepend•1mo ago
People are just raging and want an outlet. They aren’t thinking logically.

It’s been going on forever (remember how companies were reading files off your computer aka cookies in 1999?)

This seems like a total non-issue and expect that any public files are scraped by OpenAI and tons others. If I don’t want something scraped, I don’t make it public.

bigbuppo•1mo ago
For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.
accelbred•1mo ago
Seems like you could set up a cert for a honeypot domain to collect ips of bots running off of the certificate transparency logs. If domain isnt linked from anywhere, then its pretty sure to be a bot isn't it?
bigiain•1mo ago
I usually get a cert for my public domain (root and usually with www. as a Subject Alternate Name (SAN)) and if I'm going to use subdomains I don't intend to become widely public, I'll add a wildcard SAN of *.example.com so I don't have to expose subdomains in transparency logs.

There's some security downside there if my web servers get hacked and my certs exfiltrated, but for a lot of stuff that tradeoff seems reasonable. I wouldn't recommend this approach of you were a bank or a government security agency or a drug cartel.

Bender•1mo ago
It seems that OpenAI is scraping [certificate transparency] logs

They would be far from first. Any time I create a Wildcard cert in LE I immediately see a ton of sub-domain enumeration in my DNS query logs. Just for fun I create a bunch of wildcard certs for domains I do not even use just to keep their bots busy ... not used as in parked domains. This has been going on about as long as the CT logs have existed.