frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

https://www.eff.org/deeplinks/2026/03/blocking-internet-archive-wont-stop-ai-it-will-erase-webs-historical-record
56•pabs3•4h ago

Comments

xnx•1h ago
Does Internet Archive have a distributed residential IP crawler program? I would enthusiastically contribute to that.

There must be some mechanism to prevent tampering in such a setup.

progval•24m ago
The Internet Archive does not, but Archive Team does: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
SlinkyOnStairs•21m ago
Devil's advocate: Anyone seeking to limit AI scraping doesn't have much of a choice in also blocking archivists.

And it's genuinely not that weird for news organisations to want to stop AI scraping. This is just a repeat of their fight with social media embedding.

Sure. The back catalogue should be as close to public domain as possible, libraries keeping those records is incredibly important for research.

But with current news, that becomes complicated as taking the articles and not paying the subscription (or viewing their ads) directly takes away the revenue streams that newsrooms rely on to produce the news. Hence the "Newspaper trying to ban linking" mess, which was never about the links themselves but about social media sites embedding the headline and a snippet, which in turn made all the users stop clicking through and "paying" for the article.

Social media relies on those newsrooms (same with really, most other kinds of websites) to provide a lot of their content. And AI relies on them for all of the training data (remember: "Synthetic data" does not appear ex nihilo) & to provide the news that the AI users request. We can't just let the newsrooms die. The newsroom hasn't been replaced itself, it's revenue has been destroyed.

---

And so, the question of archives pops up. Because yes, you can with some difficulty block out the AI bots, even the social media bots. A paywall suffices.

But this kills archiving. Yet if you whitelist the archives in some way, the AI scrapers will just pull their data out of the archive instead and the newsrooms still die. (Which also makes the archiving moot)

A compromise solution might be for archives to accept/publish things on a delay, keep the AI companies from taking the current news without paying up, but still granting everyone access to stuff from decades ago.

There's just major disagreement about what a reasonable delay is. Most major news orgs and other such IP-holders are pretty upset about AI firm's "steal first, ask permission later" approach. Several AI firms setting the standard that training data is to be paid for doesn't help here either. In paying for training data they've created a significant market for archives, and significant incentive to not make them publicly freely accessible.

Why would The Times ever hand over their catalogue to the Internet Archive if Amazon will pay them a significant sum of money for it? The greater good of all humanity? Good luck getting that from a dying industry.

---

Tangent: Another annoying wrinkle in the financial incentives here is that not all archiving organisations are engaging in fair play, which yet further pushes people to obstruct their work.

To cite a HN-relevant example: Source code archivist "Software Heritage" has long engaged in holding a copy of all the sourcecode they can get their hands on, regardless of it's license. If it's ever been on github, odds are they're distributing it. Even when licenses explicitly forbid that. (This is, of course, perfectly legal in the case of actual research and other fair use. But:)

They were notable involved in HuggingFace's "The Stack" project by sharing a their archives ... and received money from HuggingFace. While the latter is nominally a donation, this is in effect a sale.

---

I find it quite displeasing that the EFF fails to identify the incentives at play here. Simply trying to nag everyone into "doing the thing for the greater good!" is loathsome and doesn't work. Unless we change this incentive structure, the outcome won't change.

user_7832•19m ago
> But in recent months The New York Times began blocking the Archive from crawling its website, using technical measures that go beyond the web’s traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including The Guardian, seem to be following suit.

I'm a bit surprised I never read about this till now, though while disappointing it is unfortunately not surprising.

> The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.

I suspect part of it might be these corps not wanting people to skip a paywall (whether or not someone would pay even if they had no access is a different story). But this argument makes no sense for the Guardian.

user_7832•16m ago
I went to Guardian's website to cross check their motto (getting confused with WaPo's motto) and got served this (hilarious? sad?) banner. As if blocking cross website tracking is somehow bad.

> Rejection hurts … You’ve chosen to reject third-party cookies while browsing our site. Not being able to use third party cookies means we make less from selling adverts to fund our journalism.

We believe that access to trustworthy, factual information is in the public good, which is why we keep our website open to all, without a paywall.

If you don’t want to receive personalised ads but would still like to help the Guardian produce great journalism 24/7, please support us today. It only takes a minute. Thank you.

OpenCode – Open source AI coding agent

https://opencode.ai/
869•rbanffy•14h ago•396 comments

Mamba-3

https://www.together.ai/blog/mamba-3
142•matt_d•3d ago•21 comments

Atuin v18.13 – better search, a PTY proxy, and AI for your shell

https://blog.atuin.sh/atuin-v18-13/
21•cenanozen•1h ago•1 comments

FFmpeg 101 (2024)

https://blogs.igalia.com/llepage/ffmpeg-101/
103•vinhnx•9h ago•1 comments

A Japanese glossary of chopsticks faux pas (2022)

https://www.nippon.com/en/japan-data/h01362/
291•cainxinth•14h ago•233 comments

Molly Guard

https://bookofjoe2.blogspot.com/2026/02/molly-guard.html
120•surprisetalk•21h ago•48 comments

Ghostling

https://github.com/ghostty-org/ghostling
235•bjornroberg•13h ago•40 comments

Fujifilm X RAW STUDIO webapp clone

https://github.com/eggricesoy/filmkit
66•notcodingtoday•2d ago•27 comments

Linux Applications Programming by Example: The Fundamental APIs (2nd Edition)

https://github.com/arnoldrobbins/LinuxByExample-2e
107•teleforce•11h ago•12 comments

Padel Chess – tactical simulator for padel

https://www.padelchess.me/
34•AlexGerasim•3d ago•17 comments

We rewrote our Rust WASM parser in TypeScript and it got faster

https://www.openui.com/blog/rust-wasm-parser
224•zahlekhan•14h ago•138 comments

The Los Angeles Aqueduct Is Wild

https://practical.engineering/blog/2026/3/17/the-los-angeles-aqueduct-is-wild
368•michaefe•3d ago•180 comments

We give every user SQL access to a shared ClickHouse cluster

https://trigger.dev/blog/how-trql-works
7•eallam•3d ago•2 comments

Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

https://www.eff.org/deeplinks/2026/03/blocking-internet-archive-wont-stop-ai-it-will-erase-webs-h...
57•pabs3•4h ago•5 comments

Cryptography in Home Entertainment (2004)

https://mathweb.ucsd.edu/~crypto/Projects/MarkBarry/
51•rvnx•2d ago•30 comments

Attention Residuals

https://github.com/MoonshotAI/Attention-Residuals
181•GaggiX•17h ago•25 comments

The worst volume control UI in the world (2017)

https://uxdesign.cc/the-worst-volume-control-ui-in-the-world-60713dc86950
144•andsoitis•3d ago•72 comments

The Ugliest Airplane: An Appreciation

https://www.smithsonianmag.com/air-space-magazine/ugliest-airplane-appreciation-180978708/
76•randycupertino•2d ago•44 comments

An industrial piping contractor on Claude Code [video]

https://twitter.com/toddsaunders/status/2034243420147859716
64•mighty-fine•2d ago•21 comments

Show HN: We built a terminal-only Bluesky / AT Proto client written in Fortran

https://github.com/FormerLab/fortransky
89•FormerLabFred•13h ago•45 comments

Turing Award Honors Bennett and Brassard for Quantum Information Science

https://amturing.acm.org
43•throw0101d•2d ago•0 comments

France's aircraft carrier located in real time by Le Monde through fitness app

https://www.lemonde.fr/en/international/article/2026/03/20/stravaleaks-france-s-aircraft-carrier-...
577•MrDresden•22h ago•469 comments

VisiCalc Reconstructed

https://zserge.com/posts/visicalc/
211•ingve•4d ago•77 comments

The Story of Marina Abramovic and Ulay (2020)

https://www.sydney-yaeko.com/artsandculture/marina-and-ulay
5•NaOH•2d ago•2 comments

Lent and Lisp

https://leancrew.com/all-this/2026/02/lent-and-lisp/
62•surprisetalk•2d ago•3 comments

Why One Key Shouldn't Rule Them All: Threshold Signatures for the Rest of Us

https://eric.mann.blog/why-one-key-shouldnt-rule-them-all-threshold-signatures-for-the-rest-of-us/
11•eamann•2d ago•6 comments

Our commitment to Windows quality

https://blogs.windows.com/windows-insider/2026/03/20/our-commitment-to-windows-quality/
541•hadrien01•16h ago•974 comments

ArXiv declares independence from Cornell

https://www.science.org/content/article/arxiv-pioneering-preprint-server-declares-independence-co...
761•bookstore-romeo•1d ago•266 comments

Entso-E final report on Iberian 2025 blackout

https://www.entsoe.eu/publications/blackout/28-april-2025-iberian-blackout/
199•Rygian•1d ago•96 comments

Delve – Fake Compliance as a Service

https://deepdelver.substack.com/p/delve-fake-compliance-as-a-service
707•freddykruger•1d ago•225 comments