Messing with scraper bots

https://herman.bearblog.dev/messing-with-bots/

245•HermanMartinus•2mo ago

Comments

ArcHound•2mo ago

Neat! Most of the offensive scrapers I met try and exploit WordPress sites (hence the focus on PHP). They don't want to see php files, but their outputs.

What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.

jojobas•2mo ago

Yeah, I bet they run a regex on the output and if there's no admin logon thingie where they can run exploits or stuff credentials they'll just skip.

So as to battles of efficiency, generating a 4kb bullshit PHP is harder than running a regex.

NoiseBert69•2mo ago

Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?

I'd sacrifice two CPU cores for this just to make their life awful.

qezz•2mo ago

That's very expensive.

Findecanor•2mo ago

You don't need an LLM for that. There is a link in the article to an approach using Markov chains created from real-world books, but then you'd let the scrapers' LLMs re-enforce their training on those books and not on random garbage.

I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast.

I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.

mnau•2mo ago

He addresses that. Basically, there are gatekeepers and if you get on the wrong side of them, only manual intervention can save you. And we all know how Google loves providing a human to resolve problems.

> I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using robots.txt, nofollow, and noindex rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming.

jcynix•2mo ago

If you control your own Apache server and just want to shortcut to "go away" instead of feeding scrapers, the RewriteEngine is your friend, for example:

      RewriteEngine On

      # Block requests that reference .php anywhere (path, query, or encoded)
      RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
      RewriteCond %{QUERY_STRING} \.php [NC,OR]
      RewriteCond %{THE_REQUEST} \.php [NC]
      RewriteRule .* - [F,L]

Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ.

palsecam•2mo ago

I do something quite similar with nginx:

  # Nothing to hack around here, I’m just a teapot:
  location ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ { 
      return 418; 
  }
  error_page 418 /418.html;

No hard block, instead reply to bots the funny HTTP 418 code (https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...). That makes filtering logs easier.

Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-login.php is WordPress login URL, and it’s commonly blindly requested by bots searching for weak WordPress installs.)

kijin•2mo ago

nginx also has "return 444", a special code that makes it drop the connection altogether. This is quite useful if you don't even want to waste any bandwidth serving an error page. You have an image on your error page, which some crappy bots will download over and over again.

palsecam•2mo ago

Yes @ 444 (https://http.cat/status/444). That’s indeed the lightest-weight option.

> You have an image on your error page, which some crappy bots will download over and over again.

Most bots won’t download subresources (almost none of them do, actually). The HTML page itself is lean (475 bytes); the image is an Easter egg for humans ;-) Moreover, I use a caching CDN (Cloudflare).

MadnessASAP•2mo ago

Does it also tell the kernel to drop the socket? Or is a TCP FIN packet still sent?

Be better if the scraper is left waiting for a packet that'll never arrive (till it times out obviously)

quesera•2mo ago

Beware of nginx 444 if your webserver is behind a load balancer.

The LB will see the unresponded requests and think your webserver is failing.

Ideal would be to respond at the webserver and let the LB drop the response.

jcynix•2mo ago

418? Nice I'll think about it ;-) I would, in addition, prefer that "402 Payment Required" would be instantiated for scrapers ...

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

localhostinger•2mo ago

Interesting! It's nice to see people are experimenting with these, and I wonder if this kind of junk data generators will become its own product. Or maybe at least a feature/integration in existing software. I could see it going there.

arbol•2mo ago

They could be used by AI companies to sabotage each others models

s0meON3•2mo ago

What about using zip bombs?

https://idiallo.com/blog/zipbomb-protection

lavela•2mo ago

"Gzip only provides a compression ratio of a little over 1000: If I want a file that expands to 100 GB, I’ve got to serve a 100 MB asset. Worse, when I tried it, the bots just shrugged it off, with some even coming back for more."

https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...

LunaSea•2mo ago

You could try different compression methods supported by browsers like brotli.

Otherwise you can also chain compression methods like: "Content-Encoding: gzip gzip".

kalkin•2mo ago

Ah cool that site's robots.txt is still broken, just like it was when it first came up on HN...

jeroenhd•2mo ago

Modern browsers support brotli or zstd, which is a lot better in terms of compression. Perhaps not as good for on-the-fly compression, but static assets can get a nice compression benefit out of it.

With toxic AI scrapers like Perplexity moving more and more to headless web browsers to bypass bot blocks, I think a brotli bomb (100GB of \0 can be compressed to about 78KiB with Brotli) would be quite effective.

renegat0x0•2mo ago

Even I, who does not know much, implemented a workaround.

I have a web crawler and I have both scraping byte limit and timeout, so zip bombs dont bother me much.

https://github.com/rumca-js/crawler-buddy

I think garbage blabber would be more effective.

iam-TJ•2mo ago

This reminds me of a recent discussion about using a tarpit for A.I. and other scrapers. I've kept a tab alive with a reference to a neat tool and approach called Nepenthes that VERY SLOWLY drip feeds endless generated data into the connection. I've not had an opportunity to experiment with it as yet:

https://zadzmo.org/code/nepenthes/

re-lre-l•2mo ago

Don’t get me wrong, but what’s the problem with scrapers? People invest in SEO to become more visible, yet at the same time they fight against “scraper bots.” I’ve always thought the whole point of publicly available information is to be visible. If you want to make money, just put it behind a paywall. Isn’t that the idea?

nrhrjrjrjtntbt•2mo ago

The old scrapers indexed your site so you may get traffic. This benefits you.

AI scrapers will plagiarise your work and bring you zero traffic.

ProofHouse•2mo ago

Ya make sure you hold dear that grain of sand on a beach of pre-training data that is used to slightly adjust some embedding weights

boxedemp•2mo ago

One Reddit post can get an LLM to recommend putting glue in your pizza. But the takeaway here is to cheese the bots.

exe34•2mo ago

that grain of sand used to bring traffic, now it doesn't. it's pretty much an economic catastrophe for those who relied on it. and it's not free to provide the data to those who will replace you - they abuse your servers while doing it.

jcynix•2mo ago

Sand is the world's second most used natural resource and sand usable for concrete gets even illegally removed all over the world nowadays.

So to continue your analogy, I made my part of the beach accessible for visitors to enjoy, but certain people think they can carry it away for their own purpose ...

throwawa14223•2mo ago

I have no reason to help the richest companies on earth adjust weights at a cost to myself.

georgefrowny•2mo ago

There's a difference between putting information easily online for your customers or even people in general (eg as a hobby), and working in concert with scraping for greater visibility via search, and giving that work away, or at a cost, to companies who at best don't care and possibly may be competition, see themselves as replacing you or otherwise adversarial.

The line is "I technically and able to do this" and "I am engaging with a system in good faith".

Public parks are just there and I can technically drive up and dump rubbish there and if they didn't want me to they should have installed a gate and sold tickets.

Many scrapers these days are sort of equivalent in that analogy to people starting entire fleets of waste disposal vehicles that all drive to parks to unload, putting strain on park operations and making the parks a less tenable service in general.

akoboldfrying•2mo ago

> The line is "I technically and able to do this" and "I am engaging with a system in good faith".

This is where the line should be, always. But in practice this criterion is applied very selectively here on HN and elsewhere.

After all: What is ad blocking, other than direct subversion of the site owner's clear intention to make money from the viewer's attention?

Applying your criterion here gives a very simple conclusion: If you don't want to watch the ads, don't visit the site.

Right?

akoboldfrying•2mo ago

I see downvotes, but no counterarguments.

Does anyone have a counterargument?

ryantgtg•2mo ago

I think the counterargument is that a while ago ads became super annoying. They move, they grow in size, they feature nsfw things, they have weird js that annoys you when you try to leave. Perhaps some of this has toned down in recent years, but the damage is done. The ads are not good actors. It’s not as black and white as subverting or not subverting the will of the site owner.

georgefrowny•2mo ago

You can also argue that the advertisers have abused their position with opaque and illegal uses of personal data, security hazards, and general scummishness that they are also guilty of doing where they can technically get away with rather than what they're "supposed" to do.

Not the that two wrongs make a right, and it's definitely a bit of an argument of convenience for people who find adverts annoying. But I think most people are less opposed to the idea of advertising as popularly imagined (i.e. paper newspaper-style where you just see an advert) to support their favourite blog than they are to the current web advertising model (just by viewing the advert to get an unspecified amount of information instantly stolen and sent off to a bunch of shady companies who process it and sell it on, and don't get any way to veto it before loading a website and having the damage done).

To stretch the park analogy it might be that the park sells a licence to a company to make some cash from advertising to its visitors, which it kind of expects to be things like adverts on the benches and so on. That company then starts photographing people from the bushes, recording conversations and putting Airtags in visitors' pockets to boost the profits it makes itself. Visitors then start wearing masks, stop talking and wear clothes with zipped pockets. You can say the visitors are wrong to violate the implicit park usage agreement that they submit to the surveillance to fund the park (and advertising company), or you can say that the company is wrong to expand the original license to advertise into an invasion of privacy without even telling the visitors what they were going to do before they entered, or, indeed, during or after.

Dilettante_•2mo ago

Did you read TFA?

These scrapers drown peoples' servers in requests, taking up literally all the resources and driving up cost.

saltysalt•2mo ago

You are correct, and the hard reality is that content producers don't get to pick and choose who gets to index their public content because the bad bots don't play by the rules of robots.txt or user-agent strings. In my experience, bad bots do everything they can to identify as regular users: fake IPs, fake agent strings...so it's hard to sort them from regular traffic.

aduwah•2mo ago

I wonder if the abuse bots could be somehow made to mine some crypto to give back to the bills they cause

boxedemp•2mo ago

You could try to get them to run JavaScript, but I'm sure many is them have countermeasures.

jeroenhd•2mo ago

Bots have had to up their compute budget to solve Anubis challenges, so perhaps it's possible if you trick the bot into thinking you're using Anubis to filter bots out.

Surac•2mo ago

I have just cut out up ranges that can not connect. I am blocking USA, Asia and Middle East to prevent most malicious accesses

breppp•2mo ago

Blocking most of the world's population is one way of reducing malicious traffic

gessha•2mo ago

If nobody can connect to your site, it’s perfectly secure.

warkdarrior•2mo ago

Make sure to block your own IP address to minimize the chance of a social engineering attack.

bot403•2mo ago

Include 127.0.0.1 as well just in case they get into the server.

testing22321•2mo ago

The server I have not built yet is my most secure one yet!

simondotau•2mo ago

The more things change, the more they stay the same.

About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.

Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.

The scraping stopped within two days and never came back.

[0] Random but deterministic based on post ID, so the injected text stayed consistent.

[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.

[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.

tesin•2mo ago

The vast majority of bots are still failing the header test - we organically arrived at the except same filtering in 2025. The bots followed the exact same progression too. One ip, lie about the user agent, one ASN, multiple ASNs, then lie about everything and use residential IPs, but still botch the headers

wvbdmp•2mo ago

Why do the company names chase away bots? Is it just that you’re destroying their signal because they’re looking for mentions of those brands?

akoboldfrying•2mo ago

I also didn't follow that part. Their step 2 seem to be a general-purpose bot detection strategy that works independently of their step 1 ("randomly mention companies").

SAI_Peregrinus•2mo ago

It spams the bot with false-positives. Encourages the bot admins to denylist the site to protect the bot's signal:noise ratio.

akoboldfrying•2mo ago

That was my first thought too -- but then why would the bot company care about a few false positives?

I suppose it could have an impact if 30% of all, say, Coca Cola mentions on the web came from that site, but then it would have to be a very big site. I don't think the bot company would notice, let alone care, if it was 0.01% of the mentions.

simondotau•2mo ago

Everyone’s definition of “big” is different, but back then it was big enough to get its own little island in a far corner of XKCD 802.

https://xkcd.com/802/

dotancohen•2mo ago

Diaspora?

rvba•2mo ago

They dont want to feed their model with garbage data, or this data is read and revieved by real humans

I remember years-ago (2008?) I worked in a company where every mention of it was manually reviewed by someone from PR department. I imagine now the tools are even better.

Different thing is that discussion is often very low quality (forums died for multiple reasons, reddit is dying too - astro-turf gallore now)

simondotau•2mo ago

It’s both a destruction of signal and an injection of noise. Imagine you worked for Adidas and you started getting a stream of notifications about your brand, and they were all nonsense. This would be an annoyance and harm the reputation of that monitoring service.

They would have received multiple complaints about it from customers, performed an investigation, and ultimately perform a manual excision of the junk data from their system; both the raw scrapes and anywhere it was ingested and processed. This was probably a simple operation, but might not have been if their architecture didn’t account for this vulnerability.

grishka•2mo ago

Thank you very much for the observation about headers. I just looked closer at the bot traffic I'm currently receiving on my small fediverse server and noticed that it's user agents of old Chrome versions but also that the Accept-Language header is never set, which is indeed something that no real Chromium browser would do. So I added a rule to my nginx config to return a 403 to these requests. The amount of these per second seems to have started declining.

AJMaxwell•2mo ago

That's a simple and effective way to block a lot of bots, gonna implement that on my sites. Thanks!

grishka•2mo ago

It's been a few hours. These particular bots have completely stopped. There are still some bot-looking requests in the log, with a newer-version Chrome UA on both Mac and Windows, but there aren't nearly as many of them.

Config snippet for anyone interested:

    if ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
      set $block 1;
    }
    if ($http_accept_language = "") {
      set $block "${block}1";
    }
    if ($block = "11") {
      return 403;
    }

simondotau•2mo ago

The important thing is to be aware of your adversary. If it’s a big network which doesn’t care about you specifically, block away. But if it’s a motivated group interested in your site specifically, then you have to be very careful. The extreme example of the latter is yt-dlp, which continues to work despite YouTube’s best efforts.

For those adversaries, you need to work out a careful balance between deterrence, solving problems (e.g. resource abuse), and your desire to “win”. In extreme cases your best strategy is for your filter to “work” but be broken in hard to detect ways. For example, showing all but the most valuable content. Or spiking the data with just enough rubbish to diminish its value. Or having the content indexes return delayed/stale/incomplete data.

And whatever you do, don’t use incrementing integers. Ask me how I know.

grishka•2mo ago

In my particular case, I don't mind the crawling. It's a fediverse server. There is nothing secret there. All content is available via ActivityPub anyway for anyone to grab. However, these bots specifically violated both robots.txt and rel="nofollow" while hitting endpoints like "log in to like this post" pages tens of times per second. They were just wasting my server's resources for nothing.

simondotau•2mo ago

My base advice is to make sure you have a very efficient code path for login pages. 10 pages per second is nothing if you don’t have to perform any database queries (because you don’t have any authentication token to validate).

Beyond that, look for how the bots are finding new URLs to probe, and don’t give them access to those lists/indexes. In particular, don’t forget about site maps. I use cloudflare rules to restrict my site map to known bots only.

grishka•2mo ago

Of course. My server wasn't struggling with that. I haven't benchmarked that server, but on an M1 Max, the app can easily serve hundreds of requests per second for profile pages, which is the heaviest thing an unauthenticated user can access (I cache a lot in memory, but posts, photos, and friend lists aren't among that). It was just a mild annoyance.

They discovered those URLs simply by parsing pages that contain like buttons. Those do have rel="nofollow" on them, and the URL pattern is disallowed in robots.txt, but I'd be surprised it that'd stop someone who uses thousands of IPs to proxy their requests. I don't have a site map.

thephyber•2mo ago

In the movie The Imitation Game, the Alan Turing character recognizes that acting 100% of the time gives away to the opposition that you identified them and sets off the next iteration of “cat and mouse”. He comes up with a specific percentage of the time that the Allies should sit on the intelligence and not warn their own people.

If, instead, you only act on a percentage of requests, you can add noise in an insidious way without signaling that you caught them. It will make their job troubleshooting and crafting the next iteration much harder. Also, making the response less predictable is a good idea - throw different HTTP error codes, respond with somewhat inaccurate content, etc

DamnInteresting•2mo ago

I did something similar with someone who was using my site’s donation form to test huge batches of credit cards numbers. I would see hundreds of attempted (and mostly declined) $1 donations start pouring in, and I’d block the IP. A little while later it would restart from another IP. When it became clear they were not giving up easily, I changed tack: instead of blocking them, I would return random success/failure messages at the same rate they were seeing success on previous attempts. I didn’t really try to charge those cards, of course.

I like how this kind of response is very difficult for them to detect when I turn it on, and as a bonus, it pollutes their data. They stopped trying a few days after that.

simondotau•2mo ago

Was it always $1? If I was the attacker, surely you’d pick a random number. My guess is that $1 donations would be an outlier in the distribution and therefore easy to spot.

It’s also interesting that merchants (presumably) don’t have a mechanism to flag transactions as being >0% chance of being suspect. Or that you waive any dispute rights.

As a merchant, it would be nice if you could demand the bank verify certain transactions with their customer. If I was a customer, I would want to know that someone tried to use my card numbers to donate to some death metal training school in the Netherlands.

DamnInteresting•2mo ago

They did try adding variations to the amount (+0.50-1.00) late in the game, but by then it was ineffective, I could still quickly detect them and turn on the randomized data poisoning. I expect that they want to keep the amount small so most cardholders won't bother to look into the unfamiliar charge.

I do wonder whether these people sold their list of "verified" credit card numbers to any criminal enterprises before they realized the data was poisoned. That would be potentially awkward for them.

lelanthran•2mo ago

Yup. The only real way to stop bots is be convincing the operator that your data is poisoned.

That means you need to poison the data when you detect a bot.

Kiro•2mo ago

I remember when you used to get scolded on HN for preventing scrapers or bots. "How I access your site is irrelevant".

hollow-moe•2mo ago

There's this and that. "How I [i.e. an individual human looking for myself] access your site is irrelevant." and "How I [i.e. an AI company DDOSing (which is illegal in some places btw) trying to maximize profit and offloading cost to you] access your site is irrelevant."

When you get paid big buck to make the world worse for everyone it's really simple forgetting "little details".

elashri•2mo ago

I have a side project as an academic that scrape a couple of academic jobs sites in my field and then serve them in static HTML page. It is running using github action and request every 24 hours exactly one time. It is useful for me and a couple of people in my circle. I would consider this to be fine and within the reasonable expectations. Many projects rely on such scenarios and people share them all the time.

It is completely different if I am hitting it looking for WordPress vulnerabilities or scraping content every minute for LLM training material.

Analemma_•2mo ago

To me that's the one of the most depressing developments about AI (which is chock-full of depressing developments): that its mere existence is eroding long-held ethics, not even necessarily out of a lack of commitment but out of practical necessity.

The tech people are all turning against scraping, independent artists are now clamoring for brutal IP crackdowns and Disney-style copyright maximalism (which I never would've predicted just 5 years ago, that crowd used to be staunchly against such things), people everywhere want more attestation and elimination of anonymity now that it's effectively free to make a swarm of convincingly-human misinformation agents, etc.

It's making people worse.

grishka•2mo ago

It's different. I'm fine with someone scraping my website as a good citizen, by identifying themselves in their user-agent string and preferably respecting robots.txt. I'm not, however, fine with tens of requests per second to every possible URL from random IPs I'm receiving right now, all pretending to be different old versions of Chrome.

VladVladikoff•2mo ago

This is a fundamental misunderstanding of what those bots are requesting. They aren’t parsing those PHP files, they are using their existence for fingerprinting — they are trying to determine the existence of known vulnerabilities. They probably immediately stop reading after receiving a http response code and discard the remainder of the request packets.

mattgreenrocks•2mo ago

It would be such a terrible thing if some LLM scrapers were using those responses to learn more about PHP, especially because of that recent paper pointing out it doesn't take that many data points to poison LLMs.

holysoles•2mo ago

You're right, something like fail2ban or crowdsec would probably be more effective here. Crowdsec has made it apparent to me how much vulnerability probing is done, its a bit shocking for a low-traffic host.

ajsnigrutin•2mo ago

And you'd ban the ip, their one day lease on the VM+IP would expire, someone else will get the same IP on a new VM and be blocked from everywhere.

Would be usable to ban the ip for a few hours to have the bot cool down for a bit and move onto a next domain.

holysoles•2mo ago

I was referring to the rules/patterns provided by crowdsec rather than the distribution of known "bad" IPs through their Central API.

The default ban for traffic detected by your crowdsec instance is 4 hours, so that concern isn't very relevant in that case.

The decisions from the Central API from other users can be quite a bit longer (I see some at ~6 days), but you also don't have to use those if you're worried about that scenario.

amypetrik8•2mo ago

> They aren’t parsing those PHP files, they are using their existence for fingerprinting — they are trying to determine the existence of known vulnerabilities.

So would the natural strategy then be to flag some vulnerability of interest? Either one typically requiring more manual effort (waste their time), or one that is easily automated so as to trap a bot in a honeybot i.e. "you got in, what do next? oh upload all your kit and show how you work? sure" see: the cuckoos egg

VladVladikoff•2mo ago

Yeah I have been considering doing this on my honeypot.

vachina•2mo ago

They’re not scraping for php files, they’re probing for known vulns in popular frameworks, and then using them as entry points for pwning.

This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.

BigBalli•2mo ago

I always had fail2ban but a while back I wanted to set up something juicier...

.htaccess diverts suspicious paths (e.g., /.git, /wp-login) to decoy.php and forces decoy.zip downloads (10GB), so scanners hitting common “secret” files never touch real content and get stuck downloading a huge dummy archive.

decoy.php mimics whatever sensitive file was requested by endless streaming of fake config/log/SQL data, keeping bots busy while revealing nothing.

holysoles•2mo ago

I wrote a Traefik plugin [1] that controls traffic based on known bad bot user agents, you can just block or even send them to a markov babbler if you've set one up. I've been using nepenthes [2].

[1] https://github.com/holysoles/bot-wrangler-traefik-plugin

[2] https://zadzmo.org/code/nepenthes/

firefoxd•2mo ago

I had to revisit my strategy after posting about my zipbombs on HN [0]. My server traffic went from tens of thousands to ~100k daily, hosted on a $6 vps. It was not sustainable.

Now I target only the most aggressive bots with zipbombs and the rest get a 403. My new spam strategy seems to work, but I don't know if I should post it on HN again...

[0]: https://news.ycombinator.com/item?id=43826798

ronsor•2mo ago

These aren't scraper bots; they're vulnerability scanners. They don't expect PHP source code and probably don't even read the response body at all.

I don't know why people would assume these are AI/LLM scrapers seeking PHP source code on random servers(!) short of it being related to this brainless "AI is stealing all the data" nonsense that has infected the minds of many people here.

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Making geo joins faster with H3 indexes

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Making geo joins faster with H3 indexes

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

Messing with scraper bots

Comments