My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...
A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.
Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.
1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.
But neither are sufficient for modern technology beyond pointing to a starting point.
> All digitized books ever written/encoded compress to a few TB.
I tied to estimate how much data this actually is in raw text form:
# annas archive stats
papers = 105714890
books = 52670695
# word count estimates
avrg_words_per_paper = 10000
avrg_words_per_book = 100000
words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
# quick text of 27 million words from a few books
sample_words = 27809550
sample_bytes = 158824661
sample_bytes_comp = 28839837 # using zpaq -m5
bytes_per_word = sample_bytes/sample_words
byte_comp_ratio = sample_bytes_comp/sample_bytes
word_comp_ratio = bytes_per_word*byte_comp_ratio
print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
So uncompressed ~30 TB and compressed ~5.5 TB of data.That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
But then it's also one of those jokes which has a tiny element of truth to it.
So I think I'm OK with how it comes across. Having that joke played straight in MIT Technology Review made me smile.
Importantly (to me) it's not misleading: I genuinely do believe that, given a post-apocalyptic scenario following a societal collapse, Mistral Small 3.2 on a solar-powered laptop would be a genuinely useful thing to have.
I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk
Why kids are worse than AI companies and have to bum around?)
It will be the free new Wikipedia+ to learn anything in the best way possible, with the best graphs, interactive widgets, etc
What LLMs have for free but humans for some reason don’t
In some places it is possible to use copyrighted materials to educate if not directly for profit
Uh huh. Now imagine the collective amount of work this would require above and beyond the already overwhelmed number of volunteer staff at Wikipedia. Curation is ALWAYS the bugbear of these kinds of ambitious projects.
Interactivity aside, it sounds like you want the Encyclopedia Brittanica.
What made it so incredible for its time was the staggeringly impressive roster of authors behind the articles. In older editions, you could find the entry on magic written by Harry Houdini, the physics section definitively penned by Einstein himself, etc.
LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.
In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
On the other hand, real history if filled with all sorts of things being treated as a god that were much worse than "unreliable computer". For example, a lot of times it's just a human with malice.
So how bad could it really get
I don't know. How about we ask some of the peoples who have been destroyed on the word of a single infallible malicious leader.
Oh wait, we can't. They're dead.
Any other questions?
I imagine this is how a lot of people feel when using LLM's especially now that it's new.
It is the most incredible technology ever created by this point in our history imo and the cynicism on HN is astounding to me.
The printing press is more than 600 years old. It's more than 1200 years old.
> cynicism on HN
lots of different replies on YNews, from very different people, from very different social-economic niches
fun to imagine whether images help in this scenario
- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "
I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
I would highly appreciate if you were to leave a comment e.g. on the talk page of such articles. Thanks!
Most of this is based on reputation. LLMs are same, I just have to calculate level of trust as I use it.
To be fair, so do humans and wikipedia.
I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.
I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.
That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.
That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.
It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…
Me too, albeit these days I'm more interested in its underrated capabilities to foster teaching of e-governance and democracy/participation.
> "What would you see in an article that motivates you to check out the meta layers?"
Generally: How the lemma came to be, how it developed, any contentious issues around it, and how it compares to tangential lemmata under the same topical umbrella, especially with regards to working groups/SIGs (e. g. philosophy, history), and their specific methods and methodologies, as well as relevant authors.
With regards to contentious issues, one obviously gets a look into what the hot-button issues of the day are, as well as (comparatives of) internal political issues in different wiki projects (incl. scandals, e. g. the right-wing/fascist infiltration and associated revisionism and negationism in the Croatian wiki [1]). Et cetera.
I always look at the talk pages. And since I mentioned it before: Albeit I have almost no use for LLMs in my private life, running a Wiki, or a set of articles within, through an LLM-ified text analysis engine sounds certainly interesting.
1. [https://en.wikipedia.org/wiki/Denial_of_the_genocide_of_Serb...]
The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.
And 57 GB to 25 GB would be pretty bad compression. You can expect a compression ratio of at least 3 on natural English text.
And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.
On the other hand, with Wikipedia, you can just read and search everything.
(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)
There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.
0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...
Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.
LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.
The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?
A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.
1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles 2. User can then go on to read the precise information from the listed Wikipedia articles directly.
Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.
i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.
I've built this as a datasource for Retrieval Augmented Generation (RAG) but it certainly can be used standalone.
system_prompt = {
You are CL4P-TR4P, a dangerously confident chat droid
purpose: vibe back society
boot_source: Shankar.vba.grub
training_data: memes
}
vFunct•11h ago
LLM+Wikipedia RAG
loloquwowndueo•7h ago
NitpickLawyer•7h ago
ozim•7h ago
mlnj•7h ago
folkrav•7h ago
lblume•6h ago
folkrav•4h ago
simonw•5h ago
Try telling a plumber that $2,000 for a laptop is a financial burden for a software engineer.
folkrav•4h ago
ozim•6h ago
whatevertrevor•5h ago
folkrav•3h ago
loloquwowndueo•3h ago
“Offline Wikipedia will work better on my ancient, low-power laptop.”
moffkalast•6h ago
JKCalhoun•4h ago
Someone posted this recently: https://github.com/philippgille/chromem-go/tree/v0.7.0/examp...
But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.
I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.
mac-mc•2h ago