frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Inside The Internet Archive's Infrastructure

https://hackernoon.com/the-long-now-of-the-web-inside-the-internet-archives-fight-against-forgetting
77•dvrp•1d ago
https://github.com/internetarchive/heritrix3

Comments

BryantD•1h ago
They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.
textfiles•38m ago
There is a fundamental resistance to tape technology that exists to this day as a result of all those troubles.
brcmthrowaway•1h ago
Does IA do deduplication?
textfiles•39m ago
Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.
HumanOstrich•39m ago
Yes[1].

[1]: The Article, Paragraph 2

hedora•32m ago
It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

toomuchtodo•26m ago
Pick the items you want to mirror and seed them via their torrent file.

https://github.com/jjjake/internetarchive

https://archive.org/services/docs/api/internetarchive/cli.ht...

ia search 'format:"Archive BitTorrent"' --itemlist > itemlist.txt

Note that there will be more than 50M items returned by this query, so that command will take a very long time to complete (results are returned in 10k chunks). You'll probably also want to add something like `--timeout 300` as well so you don't get half way through only for the command to fail with a timeout.

u/stavros wrote a design doc for a system (Codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219

My mental model on this is Anna's Archive's torrent page meets ArchiveTeam's Warrior. Have disk? VM starts up, picks least seeded items from public endpoint, replicates, starts serving, and coverage is constantly reported back by the archive swarm.

(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to)

philipkglass•19m ago
I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.

[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a 5xx error or "connection refused".

toomuchtodo•4m ago
https://archive.org/help/wayback_api.php

https://akamhy.github.io/waybackpy/

https://github.com/helgeho/ArchiveSpark/blob/master/notebook...

https://wiki.archiveteam.org/index.php/Restoring

cowhax•24m ago
>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence

Apple is fighting for TSMC capacity as Nvidia takes center stage

https://www.culpium.com/p/exclusiveapple-is-fighting-for-tsmc
380•speckx•5h ago•256 comments

CVEs Affecting the Svelte Ecosystem

https://svelte.dev/blog/cves-affecting-the-svelte-ecosystem
86•tobr•2h ago•12 comments

JuiceFS is a distributed POSIX file system built on top of Redis and S3

https://github.com/juicedata/juicefs
24•tosh•1h ago•14 comments

Inside The Internet Archive's Infrastructure

https://hackernoon.com/the-long-now-of-the-web-inside-the-internet-archives-fight-against-forgetting
79•dvrp•1d ago•10 comments

Ask HN: How can we solve the loneliness epidemic?

124•publicdebates•3h ago•222 comments

Claude is good at assembling blocks, but still falls apart at creating them

https://www.approachwithalacrity.com/claude-ne/
59•bblcla•1d ago•36 comments

25 Years of Wikipedia

https://wikipedia25.org
324•easton•6h ago•281 comments

First impressions of Claude Cowork

https://simonw.substack.com/p/first-impressions-of-claude-cowork
62•stosssik•1d ago•25 comments

Design and Implementation of Sprites

https://fly.io/blog/design-and-implementation/
74•sethev•4h ago•55 comments

Supply Chain Vuln Compromised Core AWS GitHub Repos & Threatened the AWS Console

https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild
35•uvuv•2h ago•2 comments

Claude Cowork runs Linux VM via Apple virtualization framework

https://gist.github.com/simonw/35732f187edbe4fbd0bf976d013f22c8
39•jumploops•1d ago•18 comments

UK offshore wind prices come in 40% cheaper than gas in record auction

https://electrek.co/2026/01/14/uk-offshore-wind-record-auction/
42•doener•1h ago•12 comments

Show HN: Tabstack – Browser infrastructure for AI agents (by Mozilla)

65•MrTravisB•1d ago•8 comments

Show HN: OpenWork – an open-source alternative to Claude Cowork

https://github.com/different-ai/openwork
34•ben_talent•1d ago•9 comments

Found: Medieval Cargo Ship – Largest Vessel of Its Kind Ever

https://www.smithsonianmag.com/smart-news/archaeologists-say-theyve-unearthed-a-massive-medieval-...
73•bookofjoe•4h ago•14 comments

Show HN: TinyCity – A tiny city SIM for MicroPython (Thumby micro console)

https://github.com/chrisdiana/TinyCity
97•inflam52•5h ago•16 comments

The URL shortener that makes your links look as suspicious as possible

https://creepylink.com/
716•dreadsword•16h ago•133 comments

‘ELITE’: The Palantir app ICE uses to find neighborhoods to raid

https://werd.io/elite-the-palantir-app-ice-uses-to-find-neighborhoods-to-raid/
166•sdoering•1h ago•85 comments

Zuck#: A programming language for connecting the world. And harvesting it

https://jayzalowitz.github.io/zucksharp/
44•kf•1h ago•21 comments

Goscript: Transpile Go to human-readable TypeScript

https://github.com/aperturerobotics/goscript
12•aperturecjs•4d ago•3 comments

Jiga (YC W21) Is Hiring Full Stack Engineers

https://jiga.io/about-us
1•grmmph•8h ago

The 3D Software Rendering Technology of 1998's Thief: The Dark Project (2019)

https://nothings.org/gamedev/thief_rendering.html
112•suioir•9h ago•48 comments

OBS Studio 32.1.0 Beta 1 available

https://github.com/obsproject/obs-studio/releases/tag/32.1.0-beta1
123•Sean-Der•5h ago•33 comments

Sinclair C5

https://en.wikipedia.org/wiki/Sinclair_C5
74•jszymborski•4d ago•47 comments

Ask HN: Anyone have a good solution for modern Mac to legacy SCSI converters?

14•stmw•1h ago•27 comments

Ask HN: Share your personal website

800•susam•1d ago•2143 comments

GitHub Incident

https://www.githubstatus.com/incidents/q987xpbqjbpl
97•aggrrrh•3h ago•73 comments

Italy's privacy watchdog, scourge of US big tech, hit by corruption probe

https://www.reuters.com/sustainability/boards-policy-regulation/italys-privacy-watchdog-scourge-u...
42•giuliomagnifico•2h ago•12 comments

Programming, Evolved: Lessons and Observations

https://github.com/kulesh/dotfiles/blob/main/dev/dev/docs/programming-evolved.md
42•dnw•6h ago•22 comments

Show HN: ContextFort – Visibility and controls for browser agents

https://contextfort.ai/
8•ashwinr2002•1d ago•1 comments