frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

France's homegrown open source online office suite

https://github.com/suitenumerique
78•nar001•1h ago•34 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
333•theblazehen•2d ago•110 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
45•AlexeyBrin•2h ago•9 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
25•onurkanbkrc•2h ago•2 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
727•klaussilveira•17h ago•227 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
987•xnx•22h ago•562 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
110•jesperordrup•7h ago•46 comments

Software Engineering Is Back

https://blog.alaindichiappari.dev/p/software-engineering-is-back
57•alainrk•1h ago•58 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
79•videotopia•4d ago•12 comments

Ga68, a GNU Algol 68 Compiler

https://fosdem.org/2026/schedule/event/PEXRTN-ga68-intro/
23•matt_d•3d ago•5 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
143•matheusalmeida•2d ago•37 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
6•sandGorgon•2d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
246•isitcontent•17h ago•27 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
254•dmpetrov•17h ago•133 comments

Cross-Region MSK Replication: K2K vs. MirrorMaker2

https://medium.com/lensesio/cross-region-msk-replication-a-comprehensive-performance-comparison-o...
5•andmarios•4d ago•1 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
349•vecti•19h ago•157 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
516•todsacerdoti•1d ago•251 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
50•helloplanets•4d ago•51 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
398•ostacke•23h ago•102 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
315•eljojo•20h ago•194 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
363•aktau•23h ago•189 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
443•lstoll•23h ago•292 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
99•quibono•4d ago•26 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
78•kmm•5d ago•11 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
283•i5heu•20h ago•233 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
26•bikenaga•3d ago•14 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
48•gmays•12h ago•20 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1096•cdrnsf•1d ago•474 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
313•surprisetalk•4d ago•46 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
160•vmatsiiako•22h ago•73 comments
Open in hackernews

ArchiveTeam has finished archiving all goo.gl short links

https://tracker.archiveteam.org/goo-gl/
425•pentagrama•5mo ago

Comments

do_not_redeem•5mo ago
Does "all" mean all the URLs publicly known, or did they exhaustively iterate the entire URL namespace?
jedberg•5mo ago
They iterated the entire URL namespace by having volunteers run a client so they didn't get IP banned.
barbazoo•5mo ago
Beautiful. I wish I had seen this and could have helped.
brokensegue•5mo ago
they are still archiving other url shorteners https://tracker.archiveteam.org:1338/ you can participate in that
Imustaskforhelp•5mo ago
are we sure that the whole entire URL namespace has been mapped?

How would that even function, I mean, did they loop through every single permutation and see the result, or what exactly/ how would that work?

toomuchtodo•5mo ago
The pipeline code is available for review of the mechanics of http requests made if you follow the ArchiveTeam wiki links.
jedberg•5mo ago
> did they loop through every single permutation and see the result, or what exactly/ how would that work?

In short, yes. Since no one can make new links, it's a pre-defined space to search. They just requested every possible key, and recorded the answer, and then uploaded it to a shared database.

ccgreg•5mo ago
The goo.gl URLs that are publicly known are already in the Internet Archive and Common Crawl crawls.
zdimension•5mo ago
Title is imprecise, it's Archiveteam.org, not Archive.org. The Internet Archive is providing free hosting, but the archival work was done by Archiveteam members.
im3w1l•5mo ago
What exactly is archiveteam's contribution? I don't fully understand.

Edit: Like they kinda seem like an unnecessary middle-man between the archive and archivee, but maybe I'm missing something.

1gn15•5mo ago
ArchiveTeam delegates tasks to volunteers and themselves running the Archive Warrior VM, which does the actual archiving. The resultant archives are then centralized by ArchiveTeam and uploaded to the Internet Archive.

(Source: ran a Warrior)

notpushkin•5mo ago
Sidenote, but you can also run a Warrior in Docker, which is sometimes easier to set up (e.g. if you already have a server with other apps in containers).
kalleboo•5mo ago
Yep, I have my archiveteam warrior running in the built-in Docker GUI on my Synology NAS. Just a few clicks to set up and it just runs there silently in the background, helping out with whatever tasks it needs to.
gunalx•5mo ago
Ran archive warrior a while back but hadde to shut it down AS i sterted seeing the VM was compromised trying to spam ssh and other login attemps in my local network.
mdaniel•5mo ago
This smells like a one-click bringup went wrong, and not that the Warrior software was compromised

Is that the story, or you are saying that the machine was secured correctly but that running Warrior somehow introduced your network to risk?

gunalx•5mo ago
It should have been properly setup, but it was a couple years ago and I migth have left to much open. (it was on a VM behind a consumer nat, firewall router solution )
debesyla•5mo ago
They gathered up the links for processing, because Google doesn't just give a list of short links in use. So the links have to be brute-forcefully gathered first.
diggan•5mo ago
> What exactly is archiveteam's contribution? I don't fully understand.

If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.

creatonez•5mo ago
What ArchiveTeam mainly does is provide hand-made scripts to aggressively archive specific websites that are about to die, with a prioritization for things the community deems most endangered and most important. They provide a bot you can run to grab these scripts automatically and run them on your own hardware, to join the volunteer effort.

This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.

For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.

They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.

dkh•5mo ago
I just made a root comment with my experience seeing their process at work, but yeah it really cannot be overstated how efficient and effective their archiving process is
iamacyborg•5mo ago
Their MediaWiki tool was also invaluable in helping us fork the Path of Exile wiki from Fandom.
wongarsu•5mo ago
> Like they kinda seem like an unnecessary middle-man between the archive and archivee

They are the middlemen that collects the data to be archived.

In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org

horseradish7k•5mo ago
liability shield
wlonkly•5mo ago
Archive Team is carrying books in a bucket brigade out of the burning library. Archive.org is giving them a place to put the books they saved.
Ayesh•5mo ago
Recent update from Google: https://blog.google/technology/developers/googl-link-shorten...
OJFord•5mo ago
This leaves me wondering what the point is? What could it possibly cost to keep redirecting existing shortlinks that they consider unused/low activity already anyway?

(In addition to the higher activity ones parent link says they'll now continue to redirect.)

toomuchtodo•5mo ago
To save face.
RicoElectrico•5mo ago
In another submission someone speculated the reason might be the unending churn of the Google tech stack that just makes low-maintenance stuff impossible.
immibis•5mo ago
My guess is that plus not having a single person left to maintain it due to the similarly unending people churn.
manquer•5mo ago
For a company also running a hosting service like GCP? nothing.

They already have plenty of unused compute /older hardware / CDN POPs, performant distributed data store and everything else possibly needed .

It would be cheaper than the free credits they giveaway just one startup to be on GCP.

I don’t think infra costs are a factor in a decision like this .

shaky-carrousel•5mo ago
Yeah, I'll take that "update" like the extremely unreliable info from an extremely unreliable company that it is.
fortran77•5mo ago
I don't really understand this. Is it really that costly to keep the entire database if they're going to keep part of it?
tombert•5mo ago
I built a URL shortener years ago for fun. I don't have the resources that Google has, but I just hacked it together in Erlang using Riak KV and it did horizontally scale across at least three computer (I didn't have more at the time).

Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.

wtallis•5mo ago
Google also has the advantages that they now only need a read-only key-value store, and they know the frequency distribution for lookups. This is now the kind of problem many programmers would be happy to spend a weekend optimizing to get an average lookup time down to tens of nanoseconds.
tombert•5mo ago
I don't think it would even cost me very much to host all these links on a GCP or AWS thing, I don't think more than a couple hundred dollars a year.

Obviously raw server costs aren't the only costs associated with something like this, you'd still need to pay software people to keep it on life support, but considering how simple URL shorteners are to implement, I still don't think it would be that expensive.

ETA:

I should point out, even something kind of half-assed could be built with Cloud Functions and BigTable really easily; this wouldn't win any kind of contests for low latency, but it would be exceedingly simple code and have sufficient uptime guarantees and would be much less likely to piss off the community.

If I had any idea how to reach out to higher-ups at Google I would offer to contract and build it myself, but that's certainly not necessary, they have thousands of developers, most of which could write this themselves in an afternoon.

benoau•5mo ago
I don't understand the data on ArchiveTeam's page but, it seems like they have 35 terabytes of data (286.56TiB)? It's a lot larger than I'd have thought.
wtallis•5mo ago
FYI, "TiB" means terabytes with a base of 1024, ie. the units you'd typically use for measuring memory rather than the units you'd typically see drive vendors using. The factor of 8 you divided by only applies to units based on bits rather than bytes, and those units use "b" rather than "B", and are only used for capacity measurements when talking about individual memory dies (though they're normal for talking about interconnect speeds).

Either way, we're talking about a dataset that fits easily in a 1U server with at most half of its SSD slots filled.

jdiff•5mo ago
The binary units like GiB, TiB, are technically supposed to be Gibibytes and Tebibytes. Thought it was a bit silly when they first popped up but now I find them adorkably endearing, and a good way to disambiguate something that's often left vague at your expense.
wtallis•5mo ago
In my experience, nobody actually says "Tebibytes" out loud; it's just that silly. In writing, when the precision is necessary, the abbreviation "TiB" does see some actual use.
hobs•5mo ago
If that's the unit, I am saying it, but yes - everyone gives me weird looks every time and just assumes I am mispronouncing terabytes but yet does not correct me.
nocoiner•5mo ago
I have a question about this.

Per google, shortened links “won't work after August 25 and we recommend transitioning to another URL shortener if you haven’t already.”

Am I missing something, or doesn’t this basically obviate the entire gesture of keeping some links active? If your shortened link is embedded in a document somewhere and can’t be updated, google is about to break it, no?

OJFord•5mo ago
About to break it if it didn't seem 'actively used' in late 2024, yes. But if your document was being frequently read and the link actively clicked, it'll (now) keep working.

But as I said in sibling comment to yours, I don't see the point of the distinction, why not just continue them all, surely the mostly unused ones are even cheaper to serve.

dang•5mo ago
Related. Others?

Enlisting in the Fight Against Link Rot - https://news.ycombinator.com/item?id=44877021 - Aug 2025 (107 comments)

Google shifts goo.gl policy: Inactive links deactivated, active links preserved - https://news.ycombinator.com/item?id=44759918 - Aug 2025 (190 comments)

Google's shortened goo.gl links will stop working next month - https://news.ycombinator.com/item?id=44683481 - July 2025 (222 comments)

Google URL Shortener links will no longer be available - https://news.ycombinator.com/item?id=40998549 - July 2024 (49 comments)

Ask HN: Google is sunsetting goo.gl on 3/30. What will be your URL shortener? - https://news.ycombinator.com/item?id=19385433 - March 2019 (14 comments)

Tell HN: Goo.gl (Google link Shortener) is shutting down - https://news.ycombinator.com/item?id=16902752 - April 2018 (45 comments)

Google is shutting down its goo.gl URL shortening service - https://news.ycombinator.com/item?id=16722817 - March 2018 (56 comments)

Transitioning Google URL Shortener to Firebase Dynamic Links - https://news.ycombinator.com/item?id=16719272 - March 2018 (53 comments)

makeworld•5mo ago
Glad I contributed to this in some small way.
Klathmon•5mo ago
Same, it's nice to see my username on the leaderboards.

Even though all I did was setup the docker container one day and forget about it

yreg•5mo ago
I wonder how many of them lead to private YouTube videos, Google documents, etc.
mdaniel•5mo ago
I was going to be cheeky and say "well, now you can download them and search" but it seems it's "Access-restricted-item: true" for some reason, above and beyond being 10G a pop <https://archive.org/details/archiveteam_googl_20250228144231...>
horseradish7k•5mo ago
you'd have to rescrape them all from https://web.archive.org/cdx/search?url=goo.gl/* - they don't publish the whole datasets
mdaniel•5mo ago
No, I meant the .warc.zst files on archive.org that were the result of the ArchiveTeam's work. However, it seems they're under some kind of embargo - which is the first I've ever seen a private link on archive.org
rafram•5mo ago
I can see some reasonable arguments for not publishing the full dataset. People undoubtedly shortened lots of links to unlisted videos/documents/pages under the assumption that the short link, like the original link, would be unguessable.
mdaniel•5mo ago
Then why go to the trouble of archiving them, then upload them to a public archive site, only to then keep them secret?

I'm sure pastebin is filled with people's AWS credentials, too, but you don't see them randomly denying access to listings

rafram•5mo ago
Because then you can access the archived destination if you already know the short URL. You just can't get a full list of potentially sensitive short URL/destination pairs.
mdaniel•5mo ago
You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks?

The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time

Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret

brokensegue•5mo ago
i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars.
cedws•5mo ago
Feels like a bit of a kick in the teeth that I contributed towards archiving something that I don’t even get access to. What happens if they disappear? The dataset is gone forever.
brokensegue•5mo ago
You get access to it via the wayback machine
qingcharles•5mo ago
This does seem off. Especially as I can navigate to any of those URLs myself. Hell, if I wanted to spin up 50 virtual servers and go crazy I could probably pay a few thousand bucks to re-scrape the thing myself.
globular-toast•5mo ago
Who fears they will get blocked by whom?
brokensegue•5mo ago
Archive team blocked by hosts wanting to protect their data from AI companies (presumably because they want to extract money from them)
mdaniel•5mo ago
This whole thread is starting to read like some kind of misguided practical joke. I also recognize that it may seem like this is directed toward you, but I'm not shooting the messenger I'm just anchoring my reply under this new information. Sorry about that.

But, ok, let's continue in good faith

scenario 1: they don't want to uncork the .warc files because it will potentially leak the means and methods of the Archive Warrior or its usages

scenario 2: they don't want to expose the target of the redirects because it will feed the boundaries of the ravenous AI slurp machines

If it's scenario 1, then CSV exists and allows mapping from the 00aa11 codes to the "location:" header, no means and methods necessary

If it's scenario 2, then what the hell were they expecting to happen? Embargo the .warc until the AI hype blows over so their great grand children can read about how the Internet was back in the day? I guess the real question is "archive for whom?" because right now unless they have a back-channel way to feed the Wayback Machine's boundary using the .warc files, and thus it secretly populates the Wayback without wholesale feeding the AI boundary, this whole thing is just mysterious

brokensegue•5mo ago
i think you're missing some key information. the warcs do not just contain the location header information. and their methods are fully public/open source so scenario 1 makes no sense.

sure maybe the warcs will be unlocked at some point in the future. this is a fairly small volunteer effort. i doubt there is some "unlock in 100 years" feature on IA.

nicolas_17•5mo ago
Yes exactly, Wayback Machine can use the warc files despite them being blocked for direct download.
yreg•5mo ago
Yeah what they did is probably the best way to handle it.
viliml•5mo ago
Tangentially related but I've seen twitter links that used to be on the wayback machine disappear from it at some point, presumably due to personal request from the owner.
corobo•5mo ago
Pretty sure you can nuke all your domains old content by blocking archive.org in robots.txt
dkh•5mo ago
Excellent! ArchiveTeam have always been impressive this way. Some years ago, I was working at a video platform that had just announced it would be shutting down fairly soon. I forget how, but one way or another I got connected with someone at ArchiveTeam who expressed their interest in archiving it all before it was too late. Believing this to be a good idea, I gave them a couple of tips about where some of our device-sniffing server endpoints were likely to give them a little trouble, and temporarily "donated" a couple EC2 instances to them to put towards their archiving tasks.

Since the servers were mine, I could see what was happening, and I was very impressed. Within I want to say two minutes, the instances had been fully provisioned and were actively archiving videos as fast as was possible, fully saturating the connection, with each instance knowing to only grab videos the other instances had not already gotten. Basically they have always struck me as not only having a solid mission, but also being ultra-efficient in how they carry it out.

Aardwolf•5mo ago
I don't understand the page, it shows a list of data sets (I think?) up to 91 TiB in size

The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?

jdiff•5mo ago
I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs.
rafram•5mo ago
> A random URL I pulled from a Google search was 705 bytes.

705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.

jdiff•5mo ago
It is long, it represents the lower hundreds of billions bound in my awful napkin math.
digitaldragon•5mo ago
The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.
lyu07282•5mo ago
did they follow the redirect and archive the page content? but why?
lyu07282•5mo ago
3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that.

https://web.archive.org/web/20250125064617/http://www.superm...

ethan_smith•5mo ago
The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear.
account42•5mo ago
Ok but the destination pages are not at risk (or at least not any more than any random page on the web) so why spend any effort crawling them before all shortcuts have been saved?
immibis•5mo ago
They might be storing in WARC format, which records all the request and response headers and maybe even TLS certificates and things.
SilverElfin•5mo ago
Is there anyone archiving all of reddit? Or twitter? I mean even if their terms have changed to not allow it.
9dev•5mo ago
Ask OpenAI maybe?
DaSHacka•5mo ago
> reddit

There used to be one such project (Pushshift), before the Reddit API change. You can download all the data and see all the info on the-eye, another datahoarder/preservationist group:

https://the-eye.eu/redarcs/

> twitter

Not that I know of, and you haven't even been able to archive tweets on the Wayback machine for YEARS.

stuffoverflow•5mo ago
Academictorrents has monthly dumps of all reddit submissions and comments even after the API restrictions.
pabs3•5mo ago
https://academictorrents.com/browse.php?search=stuck_in_the_...
SilverElfin•5mo ago
Interesting. You don’t have to be an academic to access these I guess?
mkl•5mo ago
They have magnet links and torrent files right there on the pages, so no.
Seattle3503•5mo ago
ArcticShift is a project with that goal. It picks up where PushShift left off when the API changes killed that project.

https://github.com/ArthurHeitmann/arctic_shift

pabs3•5mo ago
Viewer and stats for ArcticShift: https://photon-reddit.com/ https://arctic-shift.photon-reddit.com/
SilverElfin•5mo ago
Thanks. I wonder if anyone does this for hacker news.
mdaniel•5mo ago
I believe there is a dataset in BigQuery but I haven't tried looking at it in order to know how uptodate it is <https://news.ycombinator.com/item?id=10440502>

Given that Firebase (which powers the API link at the bottom of this page) is a Google property, I cannot possibly imagine why they'd differ

pabs3•5mo ago
ArchiveTeam was doing that, but their stuff no longer works due to changes at Reddit. The wiki page about it links to some other groups doing Reddit archiving.

https://wiki.archiveteam.org/index.php/Reddit

iJohnDoe•5mo ago
Why? Did they ask anyone if it was okay? Anything sensitive at those links? Anything at those links people didn't want or need anymore? Maybe people thought those links were dead? Did Google provide a way to cancel those links first?

It's like when the GPT links were archived and publicly available that contained sensitive information.

wiredpancake•5mo ago
Sometimes to preserve history, you just have to go do what you gotta do.

After all, these are just short links. They link to other things on the Internet. Which is inherently public anyways.

You cannot expect privacy via a simple URL. These short URLs are short, hence programmatically scraping all the URLs.

The GPT Links situation is nothing like this imo. Both however do come down to the stupid human aspect.

anticrymactic•5mo ago
It's a link, what privacy can one expect?

Especially with short links there's always the possibility of entering ~6 characters and getting a hit. So I believe expecting any secrecy from urls is silly.

That's like posting your passwords on Twitter because "Why would anyone find my account"

diath•5mo ago
If you want something to remain private, don't post it on the public internet.
NylaTheWolf•5mo ago
Hell yeah!!! Fantastic work, everyone!
m3kw9•5mo ago
Ok how do I access them, or is that not the point?
zahlman•5mo ago
The point is that content previously referred to elsewhere on the Internet (for example, on Stack Overflow) via goo.gl doesn't have to suffer unrecoverable link rot.
pabs3•5mo ago
They are being added to web.archive.org, so you would access them through that.
raldi•5mo ago
Google said they would keep hosting any recently-clicked link; does this mean that all the links are now recently-clicked?
JimDabell•5mo ago
“Recently clicked” wasn’t the criterium, it was “showed activity in late 2024”. So nothing that anybody has done this year – including this archiving – will affect which links Google keep alive.
raybb•5mo ago
Happy go have contributed a hundred thousand links by running their docker container!
edg5000•5mo ago
Can we build a blockchain/P2P-based web crawler that can create snapshots of the entire web with high integrity (peer verification)? The already-crawled pages would be exchanged through bulk transfer between peers. This would mean there is an "official" source of all web data. LLM people can use snapshots of this. This would hopefully reduce the amount of ill-behaved crawlers, so we will see less draconian anti-bot measures over time on websites, in turn making it easier to crawl. Does something like this exist? It would be so awesome. It would also allow people to run a search engine at home.
bayindirh•5mo ago
Why would I spend time and resources to feed a machine which wastes more resources to hallucinate fiction from data it ingested?

For digital preservation? We may discuss. For an LLM? Haha, no.

No, thank you.

lyu07282•5mo ago
> This would mean there is an "official" source of all web data. LLM people can use snapshots of this

that already exists, its called CommonCrawl:

https://commoncrawl.org/

patrickhogan1•5mo ago
Common Crawl, while a massive dataset of the web does not represent the entirety of the web.

It’s smaller than Google’s index and Google does not represent the entirety of the web either.

For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.

schoen•5mo ago
Is there any way to find patterns in what doesn't make it into Common Crawl, and perhaps help them become more comprehensive?

Hopefully it's not people intentionally allowing the Google crawler and intentionally excluding Common Crawl with robots.txt?

edg5000•5mo ago
Cool! I will check it out
brador•5mo ago
Gamefaqs remains unarchived.
mdaniel•5mo ago
Be the change you want to see in the world. Contributing .warc files to Archive.org isn't a gated club. My understanding of calling down the Warrior team is when something is time sensitive and needs to pseudo-ddos the site to get the bytes right now. Unless you know something about the demise of Gamefaqs, you have the rest of your life to archive a page at a time