frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Creating and Hosting a Static Website on Cloudflare for Free

https://benjaminsmallwood.com/blog/creating-and-hosting-a-static-website-on-cloudflare-for-free/
1•bensmallwood•27s ago•1 comments

"The Stanford scam proves America is becoming a nation of grifters"

https://www.thetimes.com/us/news-today/article/students-stanford-grifters-ivy-league-w2g5z768z
1•cwwc•4m ago•0 comments

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

https://cheekypint.substack.com/p/elon-musk-on-space-gpus-ai-optimus
2•simonebrunozzi•13m ago•0 comments

X (Twitter) is back with a new X API Pay-Per-Use model

https://developer.x.com/
2•eeko_systems•20m ago•0 comments

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

https://github.com/dmtrKovalenko/zlob
1•neogoose•23m ago•1 comments

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

https://github.com/mabrucker85-prog/Project_Lance_Core
1•mav5431•24m ago•1 comments

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

https://phys.org/news/2026-02-scientists-levitating-crystals.html
2•sizzle•24m ago•0 comments

When Michelangelo Met Titian

https://www.wsj.com/arts-culture/books/michelangelo-titian-review-the-renaissances-odd-couple-e34...
1•keiferski•25m ago•0 comments

Solving NYT Pips with DLX

https://github.com/DonoG/NYTPips4Processing
1•impossiblecode•25m ago•1 comments

Baldur's Gate to be turned into TV series – without the game's developers

https://www.bbc.com/news/articles/c24g457y534o
2•vunderba•25m ago•0 comments

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

https://www.youtube.com/watch?v=40SnEd1RWUU
1•dangtony98•31m ago•0 comments

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

https://github.com/bowang-lab/EchoJEPA
1•euvin•39m ago•0 comments

Disablling Go Telemetry

https://go.dev/doc/telemetry
1•1vuio0pswjnm7•40m ago•0 comments

Effective Nihilism

https://www.effectivenihilism.org/
1•abetusk•44m ago•1 comments

The UK government didn't want you to see this report on ecosystem collapse

https://www.theguardian.com/commentisfree/2026/jan/27/uk-government-report-ecosystem-collapse-foi...
3•pabs3•46m ago•0 comments

No 10 blocks report on impact of rainforest collapse on food prices

https://www.thetimes.com/uk/environment/article/no-10-blocks-report-on-impact-of-rainforest-colla...
2•pabs3•46m ago•0 comments

Seedance 2.0 Is Coming

https://seedance-2.app/
1•Jenny249•47m ago•0 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
1•devavinoth12•48m ago•0 comments

Dexterous robotic hands: 2009 – 2014 – 2025

https://old.reddit.com/r/robotics/comments/1qp7z15/dexterous_robotic_hands_2009_2014_2025/
1•gmays•52m ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•ksec•1h ago•1 comments

JobArena – Human Intuition vs. Artificial Intelligence

https://www.jobarena.ai/
1•84634E1A607A•1h ago•0 comments

Concept Artists Say Generative AI References Only Make Their Jobs Harder

https://thisweekinvideogames.com/feature/concept-artists-in-games-say-generative-ai-references-on...
1•KittenInABox•1h ago•0 comments

Show HN: PaySentry – Open-source control plane for AI agent payments

https://github.com/mkmkkkkk/paysentry
2•mkyang•1h ago•0 comments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

https://moli-green.is/
2•ShinyaKoyano•1h ago•1 comments

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

https://twitter.com/nicbstme/status/2019149771706102022
1•SubiculumCode•1h ago•0 comments

Pax Historia – User and AI powered gaming platform

https://www.ycombinator.com/launches/PMu-pax-historia-user-ai-powered-gaming-platform
2•Osiris30•1h ago•0 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
3•ambitious_potat•1h ago•4 comments

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

https://blog.afrowallet.co/en_GB/tiers-app/scams-fraud-and-fake-apps-in-africa
1•jonatask•1h ago•0 comments

Porting Doom to My WebAssembly VM

https://irreducible.io/blog/porting-doom-to-wasm/
2•irreducible•1h ago•0 comments

Cognitive Style and Visual Attention in Multimodal Museum Exhibitions

https://www.mdpi.com/2075-5309/15/16/2968
1•rbanffy•1h ago•0 comments
Open in hackernews

1 Trillion Web Pages Archived

https://blog.archive.org/trillion/
703•pabs3•4mo ago

Comments

typpilol•4mo ago
I thought this was going to be a technical article but there was nothing in it
ehsanu1•4mo ago
Seeing some stats would be fun. I wonder what the amount of data is here. And the distribution would be interesting too, especially since some pages are archived at multiple points in time, and pages have been getting heavier these days.
ChrisArchitect•4mo ago
Related blog post inviting stories:

https://blog.archive.org/2025/09/23/celebrating-1-trillion-w...

arjie•4mo ago
Something I wish we could have is some kind of peer mirror of archive.org. The main IA web application gets angry pretty quickly if you're trying to click through a few different dates. If there were some kind of way to slowly mirror (torrent-style) and offer pages as a peer from archive.org that would be neat. It would be cool to show up as an alternative source for the data and the archive.org app could fetch it out of there on a user's choice and validate the checksum if required.

In the end, I've ended up just keeping my own ArchiveBox and it's an all right experience. In the end, it's only useful for things I know I wanted to archive. For almost everything I go to the IA - which has so much.

zapataband2•4mo ago
Is there such thing as "versioned" torrents? Assuming you have the right PGP key you could mix bittorrent and packaging systems to get an update-able distribution
hsbauauvhabzb•4mo ago
A Torrent would probably suffocate under the small file distribution. I’m not sure how the romset torrents work but I thought they were versioned.

But torrent is probably the wrong tech. I’m sure there would be many players willing to host a few TB or more each, which could be fronted via something so it’s transparent to the user.

But a better option might be a subscription model, anything else will be slammed by crawlers.

pabs3•4mo ago
Couple of BEPs related to updating torrents:

https://www.bittorrent.org/beps/bep_0039.html https://www.bittorrent.org/beps/bep_0046.html

throawayonthe•4mo ago
trere is the bittorrent v2 standard: https://blog.libtorrent.org/2020/09/bittorrent-v2/

but unfortunately most foss torrent clients do not support it, partly because at release libtorrent 2.0.x had poor io performance in some cases so torrent clients reverted to the 1.2.x branch

pronoiac•4mo ago
I think SciOp is doing something in that area, with a catalog site and webseeds. https://sciop.net/
renegat0x0•4mo ago
- I can confirm that the web archive can be really slow

- I think I have seen that AI scrapers create bottleneck in the bandwidth

- To some digital archives you need to create scientific accounts (I think Common Crawl works like that)

- Data quite easily can be very big. The goal is to store many things. We not only store Internet, but with additional dimension of time

- Since there is a lot of data, it is difficult to navigate it, search it, so it easily can become unusable

- For example that is why I created my own meta data link, I needed some information about domains

Link:

https://github.com/rumca-js/Internet-Places-Database

pronoiac•4mo ago
The Archive Team - not part of the Internet Archive - worked on a distributed backup of a portion of the Internet Archive - https://wiki.archiveteam.org/index.php/INTERNETARCHIVE.BAK

It's been dormant / on hiatus for a few years now.

smallerize•4mo ago
That can only cover other collections though, because the WARC files from the Wayback Machine web scrapes are not public.
giancarlostoro•4mo ago
I do wonder why IA does not maintain a IPFS instance, or if they do, why they're not more popular? There's tons of IPFS mirror services out there that operate at reasonable speeds. One issue I've run into with IA is old enough websites that there's JS or CSS that just wont render, what I'm not sure about is, can we retroactively fix such things? Would be nice to be able to un-ruin the code somehow if they exported everything possible at the time.

Edit:

Would be really neat if you could click on a domain while on IA, and a desktop client downloads as many WAR files in a slower priority download queue, as many as you're interested in, with higher priority pages first, and then you can view it fully offline.

stavros•4mo ago
Because nobody pins on IPFS. It's basically http with extra steps, at this point.
TechSquidTV•4mo ago
They do torrents. I was looking into this recently as well, considering building an Activity Pub alternative to IA. I came to what I assume is the same conclusion that IA came to.

No one uses IPFS. For the average user, it is significantly more difficult to get started. For the experienced user, the ecosystem of tools around IPFS is extremely small.

All in all, IPFS offers very little benefit over torrents in practice and has a much smaller user pool.

outside1234•4mo ago
IPFS is a great idea poorly executed. Content addressable storage is a great idea, but it is so difficult to use in practice for real world scaled scenarios (larger than one hard disk drive).
kevincox•4mo ago
The problems with the torrents is that they can be updated if the file changes (sometimes small metadata changes) and now your seeders can't be found. Maybe if they also kept a list of old hashes so that you could at least manually try to recover data from the older torrent?
Lammy•4mo ago
This is outdated information. These issues have been solved by various BitTorrent Enhancement Proposals. You do create a new torrent, but you distribute it in a way that to a swarm member is functionally equivalent to updating an old torrent. Check out BEP-0039 and BEP-0046 which respectively cover the HTTP and DHT mechanisms for updating torrents:

https://www.bittorrent.org/beps/bep_0039.html

https://www.bittorrent.org/beps/bep_0046.html

If that updated torrent is a BEP-0052 (v2) torrent it will hash per-file, and so the updated v2 torrent will have identical hashes for files which aren't changed: https://www.bittorrent.org/beps/bep_0052.html

This combines with BEP-0038 so the updated torrent can refer to the infohash of the older torrents with which it shares files, so if you already have an old one you only have to download files that have changed: https://www.bittorrent.org/beps/bep_0038.html

NoMoreNicksLeft•4mo ago
Have any of these even started to be implemented in any client/library? It's been years.
komali2•4mo ago
I spent a bit of time trying to find it just now but I swear I read a super long blog or comment or something by someone at archive.org where they concluded essentially that IPFS just "isn't ready" or wasn't feasible for their needs because it's super slow and they didn't see how that couldn't be the case when they consider the volume of transactions they need to do (they didn't see an optimization path).

I wish I could find that article!

edit: https://github.com/internetarchive/dweb-archive/blob/master/...

stavros•4mo ago
I have a design for a system where you can "donate" your disk space to a provider. Basically, you run the client, you say you want to make 1TB available to archive.org, and their server can push the rarest content to your computer.

It's based on torrents, and you can easily make a content delivery system on top of this (so people can fetch data from this network).

I emailed a few archiving teams but nobody seemed interested, so I never made it.

toomuchtodo•4mo ago
It's a hard problem to solve, because its easy to temporarily donate resources to archiving ops via the ArchiveTeam warrior, but a long term commitment to run persistent compute and storage to mirror a chunk of the internet archive. It's why I think Filecoin isn't going to work either; very little overlap between the people who feel its important to keep these archives alive versus people who would run distributed storage to collect financial compensation for doing so.

Easier to send fiat to IA for them to invest (~$2/GB) and to pay to keep the disks spinning somewhere safe across the world.

(ia volunteer, no affiliation otherwise)

stavros•4mo ago
The system I have in mind is strictly volunteer-run, and it automatically balances the files so that it minimises rare copies.

You're right, though, long-term commitment is rare from volunteers. That's why the idea is to make short-term commitment so easy that you have a good enough pool of short-termers that it works out in the aggregate.

toomuchtodo•4mo ago
Appreciate your work on this.
stavros•4mo ago
Eh I didn't really do any work, it's just a design right now, but I think it's a nice one. If any archive team wants to work with me on this, I'd be happy to make it a reality so we have a nice FOSS system for distributed, volunteer-led backups.
toomuchtodo•4mo ago
I suggest emailing textfiles, he'll know who to connect you with in ArchiveTeam, and if there is an opportunity to connect with the decentralized web folks at ia. Strongly believe your architecture is superior to filecoin and IPFS due to relying on torrent primitives.

(ia source of truth, storage system of last resort -> item index -> torrent index -> global torrent swarm)

stavros•4mo ago
Thanks, I will!
stavros•3mo ago
I emailed him but haven't received a reply. In case you were curious for a bit more detail, here's a short design doc I wrote:

https://gist.github.com/skorokithakis/68984ef699437c5129660d...

1gn15•4mo ago
Anna's Archive has this system. This also sounds like Freenet.
stavros•4mo ago
Freenet has a bunch of encryption, which is out of scope for this. What does Anna's Archive have, besides torrents?
1gn15•4mo ago
I'm a bit confused. Isn't this such a system where people can volunteer disk space?

https://annas-archive.org/torrents

I think I'm misunderstanding you.

stavros•4mo ago
My system is more "I want to donate X GB" and it handles everything, filling that space up, getting the rarest torrents, getting updates, etc. Think of it as a central server managing a globally-distributed, unreliable JBOD in a "push" manner, rather than just downloading a torrent and being done.
zerd•4mo ago
Sounds a bit like Wuala https://www.youtube.com/watch?v=3xKZ4KGkQY8
stavros•4mo ago
Hmm, maybe, I don't remember exactly how it worked. I'll watch the video, thanks!
uses•4mo ago
Yeah, I did a scraping project a while back where I wanted to look back at historical snapshots. Getting the info out of Internet Archive was surprisingly difficult. I ended up using https://pypi.org/project/pywaybackup/, which helped quite a bit.
abustamam•4mo ago
That kinda sounds like ipfs

https://ipfs.tech/

pabs3•4mo ago
If anyone wants to help feed in more stuff, ArchiveTeam is a related volunteer group that sends data to IA:

https://archiveteam.org/

londons_explore•4mo ago
Presumably there needs to be some human to decide something is worth archiving to stop someone just using it as a free way to store all their holiday snaps?
pabs3•4mo ago
ArchiveTeam members are the ones with access to start crawls of websites, everyone can request they start a crawl, usually they ask for a reason for the crawl, and most reasons mean a crawl will happen.
jonah-archive•4mo ago
Hi, I run the datacenter/infrastructure team at the Internet Archive! We would love to see you at our various events this fall but if paying for the ticket is difficult for you, please email me (in bio) and we'll get you in (if possible).
awesomeMilou•4mo ago
What events are we talking about here?
jackling•4mo ago
Probably these: https://blog.archive.org/events/
NetOpWibby•4mo ago
I would love to work for IA but openings are rare
pabs3•4mo ago
If you are in Europe, consider Software Heritage (similar to IA but for source code) too:

https://www.softwareheritage.org/jobs/

msephton•4mo ago
Internet Archive now have a presence in Amsterdam
moralestapia•4mo ago
Hey, Q., so what's the size of the internet archive?
metalman•4mo ago
it is large enough that I am wondering if the data captured by the actual physical magnetic charges has a heft, that a person could feel. obviously the hardware would fill a house or something, but at what point does the worlds data become a discernable physical reality, at least in theory
the_real_cher•4mo ago
I'm betting exabyte or close maybe
textfiles•4mo ago
For the purposes of ballpark, between 150-200 petabytes of unique data, probably on the lowish end of that last I checked.
psychoslave•4mo ago
Are they distributed events all around the world of just in wherever the team is gathered (San Francisco I guess?)

By the way, thank you all the teams in IA, what you provide is such an important thing for humanity.

vettyvignesh•4mo ago
would love technical details around this feat. ex: how you even crawl to begin with, storage, etc
WhereIsTheTruth•4mo ago
We all know the NSA has access to servers hosted in the U.S. How are you protecting the archive from malicious tampering? Are you using any form of immutable storage? Is it post-quantum secure?
gosub100•4mo ago
Why would they do that? Have you previously seen a case where they "maliciously tampered" with anyone's website?
WhereIsTheTruth•4mo ago
I just question the integrity and immutability of the data IA is archiving, that's all

You want to know why they'd tamper data?

https://seclab.cs.washington.edu/2017/10/30/rewriting-histor...

https://blog.archive.org/2018/04/24/addressing-recent-claims...

NSA already paid to back-door RSA, got caught shiping pre-hacked routers, can rewrite pages mid-flight with QUANTUM, penetrate and siphon data from remote infected machines.. what else could they do?

https://www.amnesty.org/en/latest/news/2022/09/myanmar-faceb...

gosub100•4mo ago
IA themselves could tamper with the data, no? It was never meant to be an official historical snapshot to be pulled up for any serious or official purposes. Although it has been used that way for high profile internet drama. It's just a matter of time (maybe during an election) before it's surreptitiously altered and referenced for nefarious purposes.
southernplaces7•4mo ago
Most of all, i'm curious about how you reliably and securely store or host so many archived pages. Would you mind briefly explaining such a huge undertaking? Also, total congratulations on the fantastic achievement of this. You guys are my go-to for so much information.

Edit: And how many terabytes it all amounts to.

zhynn•4mo ago
Thanks for helping to run my favorite library on earth.
zghst•4mo ago
A great milestone for internet history!
itsme0000•4mo ago
Yeah but their view and download metrics are flat out wrong all the time. If they weren’t a nonprofit they’d be sued for that. But still great company a place for obsolete AWS equipment to retire.
psychoslave•4mo ago
What do you mean?
itsme0000•3mo ago
I run a collection on AI. The view/download numbers are very likely the result of random botting and make no logical sense in terms of rationally what you’d expect to see. I’ll see an item downloaded 10000x normal numbers for one day etc.

As for the AWS stuff. Look at the ties between these organizations, pretty clear Amazon is basically self-dealing via a non-profit to write stuff off or have some other scheme.

FooBarWidget•4mo ago
I'm kinda surprised IA hasn't long been shutdown by copyright chasers.

And for single page archives I tend to use archive.is nowadays. For as long as I can remember, IA has been unusably slow.

But still kudos to them for the effort.

fragmede•4mo ago
I very much don't get all of the show "king of the hill" being up on there.
groos•4mo ago
It wasn't shut down but definitely hobbled after they lost the lawsuit and were forced to pull copyrighted content from their site that they used to allow signed-in users to check out an hour at a time. My visits to the site dropped 10x after this.
i_have_to_speak•4mo ago
Is there an index of all these pages?
lofaszvanitt•4mo ago
Would be nice to have visit statistics per domain. So people who host their live sites could determine who visits and what on archive.org under their domain vs their live site :).
lyu07282•4mo ago
I was hoping this would include a talk by Jason Scott/@textfiles his talks are always so much fun
textfiles•4mo ago
Back at you.
timmy777•4mo ago
How do you prevent government (and other people who can access the data) from rewriting history?

Do you hash them in some sort of block chain?

The inability to rewrite history will be a fantastic gift to the world.

not--felix•4mo ago
I wonder if openai has archived more pages by now
msephton•4mo ago
1 trillion web pages archived is quite an achievement. But...there's no way to search them? You have to know what url your want to pull from the archive, which reduces the usefulness of the service. I'd like to search through all those trillion pages for, say, the name of an artist, or for a filename, or for image content.
qwertytyyuu•4mo ago
That would be hell to index
citbl•4mo ago
If it was a commercial problem, e.g. from Google, it would be solved.

The reality is that many things don't exist simply because someone isn't paid to do it.

Keyframe•4mo ago
How much AI companies have benefited by leeching off of IA and Common Crawl, it's a shame there's no at least some money flowing back in.
Exuma•4mo ago
I imagine it would be no different than current indexing strategies with a temporal aspect baked in... it would act almost like a different site, and maybe roll up the results after the fact by domain
emporas•4mo ago
I use GPT web search, and I ask it usually to find textbooks from IA. It works really well for textbooks, but not sure about web pages.
bluebarbet•4mo ago
Consider the privacy implications of that. It would effectively create a parallel web where `robots.txt` counts for nothing and where it becomes - retroactively - impossible to delete one's site. Yes, there's ultimately no way to prevent it happening, given that the data is public. But to make the existing IA searchable is IMO just a terrible idea.
breakingcups•4mo ago
Actually, I believe the IA respects robots.txt retroactively, eg. putting something on the disallow list now removes the same page scrapes from a yeaer ago from public access in teh Wayback Machine, but I'd love to be corrected on that.
bluebarbet•4mo ago
It may do. I remember looking into it and not getting a definitive answer. The issue here is that taking a site offline has surely been widely understood as the ultimate robots.txt `Disallow` instruction to search engines. IMO we should respect that.
1gn15•4mo ago
IIRC the IA no longer cares about robots.txt after it kept getting abused [1] to take down older pages. You can still request to take down pages, but it needs a form and a reason. [2]

(Remember, robots.txt is not a privacy measure, it's supposed to be something that prevents crawlers from getting stuck in tar pits!)

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

[2] https://help.archive.org/help/how-do-i-request-to-remove-som...

bluebarbet•4mo ago
Useful to know. My more general position, which apparently is not much shared here, is that removing one's site from the internet has historically meant that the site stops being accessible, stops being indexed, and stops being findable with a simple search. If, going forward, we're going to revise that norm, IMO it would be polite at least to respect it retroactively.
fragmede•4mo ago
That seems in conflict with the idea that once something's been released, it can't ever truly be unreleased.
1gn15•4mo ago
Related: https://wiki.archiveteam.org/index.php/Robots.txt

(Also, consider that when you forbid such functionality, the only thing that happens is that its development becomes private. It's like DRM: it only hurts legitimate customers.)

1gn15•4mo ago
I remember this functionality existing on Kagi or something. But I can't find it.
ks2048•4mo ago
I wonder if Internet Archive and Common Crawl have worked together?

How does their scope or infrastructure compare?

I know they serve different purposes, but both are essentially doing similar things.

pabs3•4mo ago
I think IA ingests crawl WARCs from CC, as well as other groups like ArchiveTeam.
BiraIgnacio•4mo ago
Congratulations!
yupyupyups•4mo ago
https://hoarding.support/
totaldude87•4mo ago
So instead of scrapping all webpages, one just has to pay Archive and get all the data?
strickinato•4mo ago
The artist who is playing at the in person celebration event this week (Sam Reider) is great! That's exciting
londons_explore•4mo ago
The internet archive should be striking deals with AI companies....

We'll load a truck with a copy of our complete archive if you give us a substantial donation to keep the archive going for a few more years.

If you don't agree to this deal, you can still access the archive, but it's gonna be at sluggish download speeds and take you years to get all the content.

Lapra•4mo ago
This would destroy the goodwill that they've built up as a public good. People generally don't mind you archiving their content, but if you're selling access to that data, they aren't going to stand for it.
vivzkestrel•4mo ago
kinda unrelated and stupid question: if we archived the version of every page on the internet every second for 10 years, would there be 1 decillion pages at the end of a decade?
philippz•4mo ago
We should probably copy this whole thing to IPFS and put it on chain