Inside The Internet Archive's Infrastructure

https://hackernoon.com/the-long-now-of-the-web-inside-the-internet-archives-fight-against-forgetting

456•dvrp•3w ago

https://github.com/internetarchive/heritrix3

Comments

BryantD•3w ago

They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.

textfiles•3w ago

There is a fundamental resistance to tape technology that exists to this day as a result of all those troubles.

EvanAnderson•3w ago

That's sad, but it mirrors my experience with commercial customers. Tape is so fiddly but the cost efficiency for large amounts of data and at-rest stability is so good. Tape is caught in a spiral of decreasing market share so industry has no incentive to optimize it.

Edit: Then again, I recently heard a podcast that talked about the relatively good at-rest stability of SATA hard disk drives stored outdoors. >smile<

duskwuff•3w ago

Tape is also an extraordinarily poor option for a service like Internet Archive which intends to provide interactive, on-demand access to its holdings.

stonogo•3w ago

This is a common use for tape, which can via tools like HPSS have a couple petabytes of disk in front of it, and present the whole archive in a single POSIX filesystem namespace, handling data migration transparently and making sure hot data is kept on low-latency storage.

BryantD•3w ago

Yeah, it was like this (except not petabytes).

EvanAnderson•3w ago

I presume backing-up the archive is a desirable thing. That's a place where I would see tape fitting well for them.

duskwuff•3w ago

Perhaps? But unless tape, and the infrastructure to support it, is dramatically cheaper than disk, they might still be better served by more disk - having two or more copies of data on disk means that both of them can service load, whereas a tape backup is only passively useful as a backup.

stonogo•3w ago

    unless tape, and the infrastructure to support it, is dramatically cheaper than disk,

This turns out to be the case, with the cost difference growing as the archive size scales. Once you hit petascale, it's not even close. However, most large-scale tape deployments also have disk involved, so it's usually not one or the other.

xk3•3w ago

You might squirm at using refurbished or used media but those 3TB SAS ex-enterprise disks are often the same price or cheaper than tapes themselves (excluding tape drive costs!). Will magnetic storage last 30 years? Probably not but they don't instantly demagnetize either. Both tape and offline magnetic platters benefit from ideal storage conditions.

EvanAnderson•3w ago

It's not just cost / media, though. Automated handling is a big advantage, too. At the scale where tape makes sense (north of 400TB in retention) I think the inconvenience of handling disks with similar aggregate capacity would be significant.

I guess slotting disks into a storage shelf is similar to loading a tape changer robot. I can't imagine the backplane slots on a disk array being rated at a significant lifetime number of insertions / removals.

stonogo•2w ago

If you're ok with individual storage units as small as 3TB, then we're talking about a different set of needs. At that scale, whatever you can lay hands on is probably fine. Used tape is also cheaper than new. IA is dealing with petascale, which is why I mentioned that the price difference widens with scale.

EvanAnderson•3w ago

A lot of people, me included, consider anything online not to be backup. Being disconnected and completely at-rest is a very desirable property.

Melatonic•3w ago

That's exactly what it is.

You also don't want your true backups online at all - that's the whole point.

BryantD•3w ago

Back in the day, if you loaded a page from the web archive that wasn’t in cache, it’d tell you to come back in a couple of minutes. If it was in cache, it was reasonably speedy.

Cache in this case was the hard drives. If I recall correctly, we were using SAM-FS, which worked fairly well for the purpose even though it was slow as dirt —- we could effectively mount the tape drive on Solaris servers, and access the file system transparently.

Things have gotten better. I’m not sure if there were better affordable options in the late 1990s, though. I went from Alexa/IA to AltaVista, which solved the problem of storing web crawl data by being owned by DEC and installing dozens of refrigerator sized Alpha servers. Not an option open to Alexa/IA.

Melatonic•3w ago

Tape is almost always used for cold storage backups that are offline in case of ransomware attacks. Using it for on demand access would be insanely slow

hinkley•3w ago

We had a little server room where the AC was mounted directly over the rack. I don't think we ever put an umbrella in there but it sure made everyone nervous the drain pipe would clog.

Much more recently, I worked at a medium-large SaaS company but if you listened to my coworkers you'd think we were Google (there is a point where optimism starts being delusion, and a couple of my coworkers were past it.)

Then one day I found the telemetry pages for Wikipedia. I am hoping some of those charts were per hour not per second, otherwise they are dealing with mind numbing amounts of traffic.

brcmthrowaway•3w ago

Does IA do deduplication?

textfiles•3w ago

Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.

hedora•3w ago

It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

toomuchtodo•3w ago

Pick the items you want to mirror and seed them via their torrent file.

https://help.archive.org/help/archive-bittorrents/

https://github.com/jjjake/internetarchive

https://archive.org/services/docs/api/internetarchive/cli.ht...

u/stavros wrote a design doc for a system (codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219

(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)

billyhoffman•3w ago

There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.

https://www.reddit.com/r/torrents/comments/vc0v08/question_a...

The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.

Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...

Wowfunhappy•3w ago

Is this something the Internet Archive could fix? I would have expected the torrent to get replaced when an upload is changed, maybe with some kind of 24 hour debounce.

rincebrain•3w ago

"They're working on it." [1]

It sounds like they put this mechanism into place that stops regenerating large torrents incrementally when it caused massive slowdowns for them, and haven't finished building something to automatically fix it, but will go fix individual ones on demand for now.

[1] - https://www.reddit.com/r/theinternetarchive/comments/1ij8go9...

textfiles•3w ago

It is on my desk to fix this soon.

xk3•3w ago

Also, it would be good to regenerate the web seeds metadata (this doesn't change the info_hash section) when the mirrors (subdomain prefixes) change.

(like PHP code except it is binary data--it could be done on the fly)

philipkglass•3w ago

I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.

[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."

toomuchtodo•3w ago

https://github.com/internetarchive/wayback/tree/master/wayba...

https://akamhy.github.io/waybackpy/

https://wiki.archiveteam.org/index.php/Restoring

philipkglass•3w ago

Yes, there are documents and third party projects indicating that it has a free public API, but I haven't been able to get it to work. I presume that a paid API would have better availability and the possibility of support.

I just tried waybackpy and I'm getting errors with it too when I try to reproduce their basic demo operation:

  >>> from waybackpy import WaybackMachineSaveAPI
  >>> url = "https://nuclearweaponarchive.org"
  >>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
  >>> save_api = WaybackMachineSaveAPI(url, user_agent)
  >>> save_api.save()
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      save_api.save()
      ~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 210, in save
      self.get_save_request_headers()
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 99, in get_save_request_headers
      raise TooManyRequestsError(
      ...<4 lines>...
      )
  waybackpy.exceptions.TooManyRequestsError: Can not save 'https://nuclearweaponarchive.org'. Save request refused by the server. Save Page Now limits saving 15 URLs per minutes. Try waiting for 5 minutes and then try again.

toomuchtodo•3w ago

Reach out to patron services, support @ archive dot org. Also, your API limits will be higher if you specify your API key from your IA user versus anonymous requests when making requests.

986aignan•3w ago

I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.

nodja•3w ago

It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.

vetrom•3w ago

The illegal side of hosting, sharing, and mirroring technology, as it were, is much more free to chase technical excellence at all costs.

There are lessons to be learned in that. For example, for that population, bandwidth efficiency and information leakage control invite solutions that are suboptimal for an organization that would build market share on licensing deals and growth maximization.

Without an overriding commercial growth directive you also align development incentives differently.

fuzzer371•3w ago

> a bunch of pervs

Not everyone who watches hentai is a perv

f33d5173•3w ago

Just don't look up what the word "hentai" means ;)

LoganDark•3w ago

At least hentai isn't necessarily lolisho (although a lot of it is...)

account42•3w ago

Yeah sure, they are just into 9000 year old dragons...

gosub100•3w ago

I was hopeful a few years ago when I heard of chia coin, that it would allow distributed internet storage for a price.

Users upload their encrypted data to miners, along with a negotiated fee for a duration of storage, say 90d. They take specific hashes of the complete data, and some randomized sub hashes, of internal chunks. Periodically an agent requests these chunks, hashes and rewards a fraction of the payment of the hash is correct.

That's a basic sketch, more details would have to be settled. But "miners" would be free to delete data if payment was no longer available on a chain. Or additionally, they could be paid by downloaders instead of uploaders for hoarding more obscure chunks that aren't widely available.

qingcharles•3w ago

The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.

astrange•3w ago

Has any evidence been provided for this fact?

textfiles•3w ago

They absolutely are.

razakel•3w ago

Of course they are. Had to block anything at work coming from one certain company because it wasn't respecting robots.txt and the bill was just getting silly.

stonogo•3w ago

Yes.

Gormo•3w ago

How is it "egregious" that people are obtaining content to use for their own purposes from a resource intentionally established as a repository of content for people to obtain and use for their own purposes?

Intralexical•3w ago

Because nobody who opens a public library does so intending, nor consenting, for random companies to jam the entrance trying to cart off thousands of books solely to use for their own enrichment.

https://xkcd.com/1499/

renewiltord•3w ago

No one is carting off anything. You still have the original. If you could make copies of books for almost no cost why would you hoard them?

dxdm•3w ago

Requests to physical servers over physical media are not free. Someone needs to pay for providing and maintaining the infrastructure etc etc. Finite resources are still getting used up by people not paying for them. That's what this thread and the analogy are about.

jasonwatkinspdx•3w ago

Years and years ago I shared a cubicle with a woman named Tracy. A couple times a month Tracy would get lunch at the Mongolian BBQ place down the road (all you can eat stir fry that has nothing to do with Mongolian food for anyone unfamiliar).

Anyhow, Tracy would put a gallon sized ziplock bag into her purse, and at the restaurant shovel half a dozen plates worth of food into it. Then she'd work the afternoon eating out of her purse like it's a bowl, just sitting there on the desk.

sailfast•3w ago

Might be easier for them to just pay for the mirrors and do an on-site copy and move the data in a container?

That way they would provide some more value back to the community as a mirror?

hinkley•3w ago

I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.

xp84•3w ago

Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”

quux•3w ago

Is running an IPFS node and pinning the internet archive's collections a good way to do this?

skywhopper•3w ago

Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.

Gormo•3w ago

> $25-30M per year is a lot for a non-profit

$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.

Medium_Taco•3w ago

You're being purposefully obtuse. Most non-profits don't function at scale (neither do they do best at scale). They serve their local community

esseph•3w ago

You have an extremely skewed view of the average nonprofit

anonnon•3w ago

Facilitating mirroring seems like it would open up another can of liability worms for the IA, as well as, potentially, for those mirroring it. For example, they recently lost an appeal of a major lawsuit brought by book publishers. And then there's the Wayback Machine itself; who knows what they've hoovered up from the public internet over the years? Would you be comfortable mirroring that?

traceroute66•3w ago

> $25-30M per year is a lot for a non-profit

First, whether IA or any other large non-profit/charity. When you are in the double-digit/triple-digit multi-million bracket, you are no longer a non-profit/charity. You are in effect a business with a non-profit status.

Whether IA or any other large entity, when you get to that size, you don't benefit from the "oh they are a poor non-profit" mindset IMHO.

To be able to spend $25-30M a year, you clearly have to have a solid revenue stream both immediate and in the pipeline, that's Finances 101. Therefore you are in a privileged and enviable position that small non-profits can only dream of.

Second, I would be curious to know how much of that is of their own doing.

By that I mean, its sure cute to be located in the former Christian Science church on Funston Avenue in San Francisco’s Richmond District.

But they could most likely save a lot of money if they were located in a carrier-neutral facility.

For example, instead of paying for expensive external fiber lines (no doubt multiple, due to redundancy), they would have large amounts of capacity available through simple cross-connects.

Similar on energy. Are they benefiting from the same economies of scale that a carrier-neutral facility does ?

I am not saying the way they are doing it is wrong. I'm just genuinely curious to know what premium they are paying for doing it like they are.

leoc•3w ago

Probably the advantages of its location outweigh the extra costs for the IA. Having your datacentre sited on land and in a building you own, behind a non-shared front door, has legal advantages similar to the ones which drive organisations to keep their data centres on-premises. A distinctive location in a nice area of San Francisco probably helps to keep cultivating the goodwill of the SV tech industry and of local and state politicians. It's also an advantage to be within easy walking distance in a neighbourhood where people like the IA and would be inclined to go there and protest if government forces rolled up and started pushing their way inside. To be sure, I presume that 300 Funston Ave. also being a very pleasant workplace for senior IA people has something to do with why the Archive moved there and remains there; but remaining there seems justifiable for other reasons.

textfiles•3w ago

This seems like a lot of zesty made-up assumptions.

And a lot of non-profits would be very very surprised to hear that once you cross the threshold of $9,999,999 costs, you are a business.

traceroute66•3w ago

> This seems like a lot of zesty made-up assumptions.

Nope.

The second half of my post, anyone who has been seriously involved with large carrier-neutral facilities will likely agree with me.

It is a fact that IA will be incurring a premium to DIY and as I quite clearly spelt out, I am NOT trying to say they are wrong, I am just genuinely curious as to what the premium they are paying is.

Regarding my comment about large non-profits. This is from personal experience. Once they get to a certain size, non-profits do switch to a business mentality. You might not like that fact, but it is a fact. They will more often than not have management boards who are "competitively remunerated". They will almost always actively manage their spare cash (of which they will have a large surplus) in investment portfolios. Things will be budgeted and cost-centered just like in larger businesses. They will have in-house legal teams or external teams on retainer to write up philanthropic contracts and aggressively chase after donations people leave them in wills. etc. etc. etc. etc.

You absolutely cannot place a large non-profit in the same mindset as your local community mom & pop non-profit that operates hand to mouth on a shoestring.

That is why I discourage people donating to large non-profits. You might feel good donating $100. But in reality its a sum that wouldn't even be a rounding-error on their financial reports. And in the majority of cases most of your donation is more likely to contribute to management expenses than the actual cause.

Large non-profits are more interested in large corporate philanthropic donations, preferably multi-year agreements. They have more than enough money for the immediate future (<=12–18 months), they want large chunks of future money in the pipeline and that's what the large philanthropic agreements give them.

textfiles•2w ago

The assumptions are still pretty zesty, now you just made them longer.

lazylizard•2w ago

i dunno why i keep imagining something like ipfs could help something like the internet archive...

cowhax•3w ago

>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence

schmuckonwheels•3w ago

Disappointed with the lack of pictures.

parttimelarry•3w ago

Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.

schmuckonwheels•3w ago

I wasn't expecting to read a podcast when clicking.

textfiles•3w ago

What do you want some pictures of?

schmuckonwheels•3w ago

An article about "infrastructure" that opens up with a dramatic description of a datacenter stuffed into an old church, I would expect more than just generic clipart you'd see in the back half of Wired magazine.

textfiles•3w ago

Here's some photos I took a long time ago.

https://www.flickr.com/photos/textfiles/albums/7215763372220...

Tempest1981•3w ago

Thanks! The church attendees (employees?) have a Severence Kier vibe... although I'm guessing the TV show came much later.

darkwater•3w ago

That's super cool! Can the IA building be accessed by some random people like myself? Next time I'm in SF (who knows when that will be though) I'd very much like visiting it!

textfiles•3w ago

Fridays at about 1pm, we give tours.

schmuckonwheels•3w ago

That's great. Ask and ye shall receive.

What's most surprising is churches notoriously have really sketchy electrical. There had to be some renovation in that regard, right?

fc417fc802•3w ago

No regular residential building is set up to host a datacenter off the bat. Even racking more than half a dozen boxes in a given room requires an upgrade.

Most rooms in North America won't be wired for anything over 2.5 kW by default (kitchens and laundry rooms being obvious exceptions).

An electric dryer might pull 5 kW. An electric range ballpark 10 kW. Versus 15 kW per full rack for a fairly tame setup.

And then you've got the problem of dissipating all that heat.

textfiles•3w ago

There was a lot of renovation. One day they fired up the pipe organ (which still works) inside the building as well as the servers and the transformer for the street blew up. That was a legendary day.

mcpar-land•3w ago

Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.

NetOpWibby•3w ago

I was thinking the same thing. No proofreading is a sure sign to me. I also feel like I've read parts of this before.

sltkr•3w ago

Some of the images are AI generated (see the Gemini watermark in the bottom right), and the final paragraph also reads extremely AI-generated.

dvrp•3w ago

Maybe, but I was trying to find the original source of this article and couldn’t, at least not cursorily.

ramon156•3w ago

I already stopped when I saw the AI-gen image

eiiot•2w ago

The table also seems like the kind of thing that Gemini seems to generate a lot. "Here's a table that communicates almost no information! One of the rows is constant for each item."

lysace•3w ago

The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.

mixologic•3w ago

They can offer a perk that literally no other tech job can offer: Someday have a statue of your likeness preserved in ceramic: https://www.atlasobscura.com/places/internet-archive-headqua...

"Inside the church's main room, with its still-intact pews, there are more than 120 ceramic sculptures of the Internet Archive's current and former employees, created by artist Nuala Creed and inspired by the statues of the Xian warriors in China."

textfiles•3w ago

We've hired a few dozen people over the past couple of years. We think they're pretty talented.

lysace•3w ago

Is retreival from the wayback machine intentionally made slow?

textfiles•3w ago

Show me the faster wayback machine we are competing against.

brokensegue•3w ago

i'm a big fan of IA and wayback machine. i donate. but i do wish it were faster. i understand that would cost a lot more though.

i wonder if maybe donors above a certain level could get priority on archiving pages or something.

lysace•3w ago

Do you really think that is a good argument against the perception of technical stagnation?

pizza•3w ago

That sounds really entitled.

textfiles•3w ago

We've had showdowns with lawyers, governments, hackers and spammers, but I'm not sure how we'll stand up against perception.

rarisma•3w ago

I think this was writen wholly by deep research.

It just reads like a clunky low quality article

astrange•3w ago

It's clearly AI writing ("hum", "delve") but oddly I don't think deep research models use those words.

joemi•3w ago

I think relying on the vocabulary to indicate AI is pointless (unless they're actually using words that AI made up). There's a reason they use words such as those you've pointed out: because they're words, and their training material (a.k.a. output by humans) use them.

astrange•3w ago

No American used "delve" before ChatGPT 3.5, and nobody outside fanfiction uses the metaphors it does (which are always about "secrets" "quiet" "humming" "whispers" etc). It's really very noticeable.

https://www.nytimes.com/2025/12/03/magazine/chatbot-writing-...

pests•3w ago

But now Americans do use "delve" since 3.5. So what? No Americans used "cromulent" as a word either until Simpsons invented it. Is it not a real word? Does using it mean the Simpsons wrote it?

ashtonshears•3w ago

I bet the llm is biased towards the mtg card delver of secrets

joemi•3w ago

The link you posted doesn't back up the statement that "No American used "delve" before ChatGPT 3.5". Instead it states that _few_ people used it in _biomedical papers_. I've seen it (and metaphors using the other words you noted) used in fiction for my entire life, and I sure as hell predate chatgpt. This is why it's a bad idea to consider every use of particular words to be AI generated. There are always some people who have larger vocabularies than others and use more words, including words some people have deemed giveaways of AI use.

That said, their use may raise suspicion of AI, but they are _not_ proof of AI. I don't want to live in a world where people with large vocabularies are not taken seriously. Such an anti-intellectual stance is extremely dangerous.

astrange•3w ago

I've been reading deep research results every day for months now and I promise I know what AI writing style looks like.

It has nothing to do with "large vocabularies". I know who the people with large vocabularies were that originally caused the delving thing too, and they weren't American. (Mostly they were Nigerian.) I'm confused what you think specific kinds of metaphors involving sounds have to do with large vocabularies though.

> I've seen it (and metaphors using the other words you noted) used in fiction for my entire life

And the point is that this article isn't fiction. Or not supposed to be anyway.

joemi•3w ago

People with large vocabularies tend to be heavy readers, and therefore experiencing these words and metaphors more than people with smaller vocabularies. I think there's a direct link between people attempting to use certain words as proof of AI and the fact the younger generations aren't reading as much as older generations.

https://www.nytimes.com/2025/12/12/us/high-school-english-te...

Somewhat contradictory, I don't think you can ignore fiction when discussing technical writing, since technical writing (especially online) has become far more casual (and influenced by conversation, pop culture, and yes, even fiction) than it ever was before. So while as I noted above, younger people are reading less these days, people are also less strict about how formal technical writing needs to be, so they may very well include words and expressions not commonly seen in that style of writing in the past.

I'm not arguing that these things can't be indicators of AI generation. I'm just arguing that they can't be proof of AI generation. And that argument only gets stronger as time goes on an more people are (sadly) influenced by things AI have generated.

bpiche•3w ago

IA is hosting a couple more of Rick Prelinger’s shows this month. Looking forward to visiting

ghm2199•3w ago

Does any one know how the size of this compares to archive.today?

textfiles•3w ago

We absolutely lap them with many, many more petabytes of material. But archive.today is also not doing speculative or multiple scheduled captures of the amount of sites that archive.org is.

vladiim•3w ago

How long will it take for them to send the PetaBox to space?

textfiles•3w ago

That project gets discussed every once in a while.

semiquaver•3w ago

This article is way too LLMey for my taste.

alfgrimur•3w ago

I love to imagine this is all a cover and the Internet Archive is located in a remote cave in northern Sweden and consists of a series of endlessly self replicating flash drives powered by the sun.

segalord•3w ago

this is every data hoarders dream setup haha

jarboot•3w ago

Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.

shmeeed•3w ago

FWIW, it's the same for me on FF Android.

nandomrumber•3w ago

Same for me, Safari iOS 18.7.1 no accessibility font size set, no browsers font size set.

textfiles•3w ago

It's an AI-generated article. It's going to be pretty terrible.

initialg•3w ago

Is it still year 2006 and websites haven’t figured out responsive design?

fedeb95•3w ago

Thanks for this, I've always wondered how the Archive operates but always ended up not searching.

ThinkBeat•3w ago

Wow that piece of real-estate has to cost a bundle.

bilater•3w ago

I have always wondered how archives manage to capture screenshots of paywalled pages like the New York Times or the Wall Street Journal. Do they have agreements with publishers, do their crawlers have special privileges to bypass detection, or do they use technology so advanced that companies cannot detect them?

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I Write Games in C (yes, C)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

SectorC: A C Compiler in 512 bytes

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Ga68, a GNU Algol 68 Compiler

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I Write Games in C (yes, C)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

SectorC: A C Compiler in 512 bytes

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Ga68, a GNU Algol 68 Compiler

Inside The Internet Archive's Infrastructure

Comments