frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

https://scalingintelligence.stanford.edu/blogs/tokasaurus/
55•rsehrlich•1h ago•5 comments

APL Interpreter – An implementation of APL, written in Haskell (2024)

https://scharenbroch.dev/projects/apl-interpreter/
32•ofalkaed•1h ago•9 comments

The impossible predicament of the death newts

https://crookedtimber.org/2025/06/05/occasional-paper-the-impossible-predicament-of-the-death-newts/
372•bdr•9h ago•123 comments

Show HN: Claude Composer

https://github.com/possibilities/claude-composer
11•mikebannister•28m ago•7 comments

Seven Days at the Bin Store

https://defector.com/seven-days-at-the-bin-store
118•zdw•7h ago•49 comments

SkyRoof: New Ham Satellite Tracking and SDR Receiver Software

https://www.rtl-sdr.com/skyroof-new-ham-satellite-tracking-and-sdr-receiver-software/
31•rmason•4h ago•1 comments

The Universal Tech Tree

https://asteriskmag.com/issues/10/the-universal-tech-tree
41•mitchbob•3d ago•27 comments

Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX

https://github.com/hyperdxio/hyperdx
139•mikeshi42•5h ago•31 comments

Programming language Dino and its implementation

https://github.com/dino-lang/dino
27•90s_dev•5h ago•8 comments

Show HN: String Flux – Simplify everyday string transformations for developers

https://stringflux.io
6•eaglepeak•1h ago•1 comments

Converge (YC S23) Well-capitalized New York startup seeks product developers

https://www.runconverge.com/careers
1•thomashlvt•2h ago

Eleven v3

https://elevenlabs.io/v3
119•robertvc•4h ago•75 comments

Understanding the PURL Specification (Package URL)

https://fossa.com/blog/understanding-purl-specification-package-url/
65•todsacerdoti•7h ago•50 comments

A proposal to restrict sites from accessing a users’ local network

https://github.com/explainers-by-googlers/local-network-access
588•doener•1d ago•334 comments

Aurora, a foundation model for the Earth system

https://www.nytimes.com/2025/05/21/climate/ai-weather-models-aurora-microsoft.html
68•rmason•4h ago•16 comments

ICANN fee price hike by 11% [pdf]

https://itp.cdn.icann.org/en/files/contracted-parties-communications/attn-planned-variable-accreditation-fee-adjustment-24oct24-en.pdf
26•NoahZuniga•5h ago•7 comments

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts

https://daringfireball.net/linked/2025/06/04/apple-notes-markdown
242•robenkleene•9h ago•144 comments

Phptop: Simple PHP ressource profiler, safe and useful for production sites

https://github.com/bearstech/phptop
93•kadrek•14h ago•13 comments

Air Lab – A portable and open air quality measuring device

https://networkedartifacts.com/airlab/simulator
310•256dpi•15h ago•146 comments

Twitter's new encrypted DMs aren't better than the old ones

https://mjg59.dreamwidth.org/71646.html
177•tabletcorry•9h ago•167 comments

Rare black iceberg spotted off Labrador coast could be 100k years old

https://www.cbc.ca/news/canada/newfoundland-labrador/black-iceberg-labrador-coast-1.7551078
103•pseudolus•7h ago•44 comments

Show HN: iOS Screen Time from a REST API

https://www.thescreentimenetwork.com/api/
71•anteloper•5h ago•40 comments

Autonomous drone defeats human champions in racing first

https://www.tudelft.nl/en/2025/lr/autonomous-drone-from-tu-delft-defeats-human-champions-in-historic-racing-first
297•picture•1d ago•240 comments

OpenAI slams court order to save all ChatGPT logs, including deleted chats

https://arstechnica.com/tech-policy/2025/06/openai-says-court-forcing-it-to-save-all-chatgpt-logs-is-a-privacy-nightmare/
1064•ColinWright•1d ago•861 comments

LLMs and Elixir: Windfall or Deathblow?

https://www.zachdaniel.dev/p/llms-and-elixir-windfall-or-deathblow
224•uxcolumbo•1d ago•111 comments

From tokens to thoughts: How LLMs and humans trade compression for meaning

https://arxiv.org/abs/2505.17117
101•ggirelli•15h ago•21 comments

Show HN: Container Use for Agents

https://github.com/dagger/container-use
20•aluzzardi•5h ago•3 comments

parrot.live

https://github.com/hugomd/parrot.live
211•jasonthorsness•1d ago•51 comments

End of an Era: Landsat 7 Decommissioned After 25 Years of Earth Observation

https://www.usgs.gov/news/national-news-release/end-era-landsat-7-decommissioned-after-25-years-earth-observation
102•keepamovin•19h ago•39 comments

Gemini-2.5-pro-preview-06-05

https://deepmind.google/models/gemini/pro/
296•jcuenod•6h ago•177 comments
Open in hackernews

Ask HN: Has anybody built search on top of Anna's Archive?

274•neonate•1d ago
Wouldn't this basically give us Google Books and searchable Scihub at the same time?

What would it cost?

Comments

ggm•1d ago
You must mean free text search and page level return, because it already has full metadata indexing.

The thing is AA doesn't hold the texts. They're disputable IPR and even a derived work would be a legal target.

carlosjobim•1d ago
> a derived work would be a legal target.

Why would it? Google isn't prosecuted for indexing the web.

1970-01-01•1d ago
Oh it certainly is. https://www.reuters.com/sustainability/boards-policy-regulat...
trollbridge•1d ago
That's not prosecution for indexing the web. Google is being treated the same way AT&T was for telephones.
1970-01-01•1d ago
https://harvardlawreview.org/print/vol-138/united-states-v-g...
bhaney•1d ago
Honestly I don't think it would be that costly, but it would take a pretty long time to put together. I have a (few years old) copy of Library Genesis converted to plaintext and it's around 1TB. I think libgen proper was 50-100TB at the time, so we can probably assume that AA (~1PB) would be around 10-20TB when converted to plaintext. You'd probably spend several weeks torrenting a chunk of the archive, converting everything in it to plaintext, deleting the originals, then repeating with a new chunk until you have plaintext versions of everything in the archive. Then indexing all that for full text search would take even more storage and even more time, but still perfectly doable on commodity hardware.

The main barriers are going to be reliably extracting plaintext from the myriad of formats in the archive, cleaning up the data, and selecting a decent full text search database (god help you if you pick wrong and decide you want to switch and re-index everything later).

notpushkin•1d ago
I think there’s a couple ways to improve it:

1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.

2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.

palmfacehn•1d ago
There should be a way to leverage compression when storing multiple editions of the same book.
bawolff•1d ago
From a good search perspective though you probably dont want 500 different versions of the same book popping up for a query
palmfacehn•1d ago
Agreed. I would prefer to see a single result for a single title. The option of pursuing different editions should follow from there.
qingcharles•1d ago
And without some sort of weighting system, it wouldn't even know which one is the best one to show the user.
notpushkin•13h ago
We’ll also need to consider that some versions might be easier to index even though the user would prefer another version. E.g. if we have a TXT and EPub, we might want to index TXT (if it’s clean enough), but present user with EPub (with formatting and stuff).

But it’s not a huge problem actually: just link to the search page instead and let the user decide what they want to download.

throwup238•1d ago
How are you going to download the top 100k? The only reasonable way to download that many books from AA or Libgen is to use the torrents, which are sorted sequentially by upload date.

I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.

notpushkin•13h ago
For each book they store its exact location in the torrent files. You can see on the book page, e.g.:

collection “ia” → torrent “annas-archive-ia-acsm-n.tar.torrent” → file “annas-archive-ia-acsm-n.tar” (extract) → file “notesonsynthesis0000unse.pdf”

But probably you should get it from the database dumps they provide instead of hammering the website.

So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.

WillAdams•1d ago
The thing is, for an ISBN, that is one edition, by one publisher and one can easily have the same text under 3 different ISBNs from one publisher (hardcover, trade paperback, mass-market paperback).

I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:

https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...

granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.

There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:

https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

along with others.

notpushkin•13h ago
Hmmm, yeah, ISBN isn’t great for this. Is there a good way to deduplicate the books by their contents?
WillAdams•12h ago
LoC or Dewey Decimal with author and title (and edition?) should work.

I wish there was some better book cataloging/organizing scheme --- the Online Books Page uses LoC:

https://onlinebooks.library.upenn.edu/subjects.html

and is the most workable of the indices I've used.

serial_dev•1d ago
The main barriers for me would be:

1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?

2. Potential legal issues.

The technical barriers are at least challenging and interesting.

Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.

namlem•1d ago
It would be incredible for LLMs. Searching it, using it as training data, etc. Would probably have to be done in Russia or some other country that doesn't respect international copyright though.
sam_lowry_•1d ago
LLMs already use it, dude )
exe34•1d ago
I think one use would be to search for information directly from a book, rather than get a garbled/half-hallucinated version of it.
jdironman•1d ago
You don't need AI for that. I get the optimistic spirit of what you mean though.
mdp2021•1d ago
Optimized information retrieval of complex text is AI.
echollama•1d ago
garbled/half-hallucinated is probably what you would've gotten 8-12mo ago but now adays im sure with good prompting you can pull value from any book.
jxjnskkzxxhx•1d ago
Do you have a reason to believe this ain't already being done? I would assume that the big guys like openai are already training on basically all text in existence.
IlikeKitties•1d ago
In fact, facebook torrented annas archive and got busted for it, because of course they did:

https://torrentfreak.com/meta-torrented-over-81-tb-of-data-t...

HDThoreaun•1d ago
Every LLM maker probably did the same. Facebook just has disgruntled employees who leaked it
gpm•1d ago
Google goes around legally scanning every book they can get their hands on with books.google.com. Legally scanning every paper they can get their hands on with scholar.google.com.

I doubt they'd resort to piracy for what is basically the same information as what they've already legally acquired...

lcnPylGDnU4H9OF•1d ago
That is a good reason to think they did not but it doesn't necessarily override reasons for them to do so. Perhaps it's dubious that the subset of data they could not legally get their hands on is an advantage for training but I really don't know, and maybe nobody does. Given that, Google's execs may have been in favor of similar operations as Facebook's and their lawyers may have been willing to approve them with similar justifications.
sneak•23h ago
Downloading a torrent isn't piracy if you are a license holder for the information that you are downloading.
gpm•22h ago
*If the license you have authorizes you to make a copy in that fashion.

But here, Google isn't a license holder. Google doesn't license the text in Google Books (unless something has changed since the lawsuits). Google simply legally acquires (buys, borrows, etc) a copy of the book and does things with it that the US courts have found are fair use and require no license.

Incidentally I believe the French courts disagreed and fined them half a million dollars or so and ordered them to stop in France.

ar_lan•1d ago
Wasn't this confirmed what Meta does?

https://www.forbes.com/sites/danpontefract/2025/03/25/author...

executesorder66•1d ago
> or some other country that doesn't respect international copyright though.

Like the US? OpenAI et al. don't give a shit.

TeMPOraL•1d ago
There's a difference between feeding massive amounts of copyrighted material to a training process that blends them thoroughly and irreversibly, and doing all that in-house, vs. offering people a service that indexes (and possibly partially rehosts) that material, enabling and encouraging users to engage directly in pirating concrete copyrighted works.
corgi912•1d ago
There's this famous phrase in Russian that was born out of a short interview with a woman, a strong Putin supporter, that's often been used as a sarcastic remark for pointing out someone's double standards and/or hypocrisy.

It can be roughly translated to "you don't understand, it's a completely different situation". That's what's constantly on my mind when I'm reading discussions like this one.

Everybody and their dog torrenting petabytes of data and getting away with it (Meta is the only one that got caught and they've still gotten away with doing it)?

The very same data poor American students were forced to commit suicide over? The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?

Yet when giant corporations are doing the exact same thing on a massive scale, it's fine? It's not even the same thing, an American student torrenting books isn't making any money off it, while Meta very much is.

Of course it's not the same, a simple-minded and poorly educated person like me isn't capable of understanding the difference. You keep believing in your moral superiority, the rest of the world has finally woken up.

Exoristos•1d ago
There are those who are in charge and those who aren't.
TeMPOraL•1d ago
Is there also a famous Russian phrase that translates to "details are irrelevant, it kinda looks similar to me therefore it's the same"? If not, there definitely should be.

The details are the entire point. Arguing that a corporation can get away doing something, while an individual can't, isn't useful, because there are great many of such somethings, and in most cases it turns out perfectly reasonable, once you dig into details.

sneak•23h ago
> The very same data poor American students were forced to commit suicide over

Leaving the rest of your argument aside, precisely nobody forced aaronsw to commit suicide.

TeMPOraL•14h ago
There's also a matter of 'aaronsw being a student, not many "poor American students" as GP implies. As far as I know, this was the only case of this type[0][1].

Honestly was too tired to point that out in my earlier reply, but that's exactly the kind of argument you get when people are not willing (or purposefully refusing) to consider details. Intentionally or not, you get bogus and highly manipulative statements.

A single case of a student activist fighting for freedom of communication and access to public goods for citizens, ending up breaking under pressure from public/non-profit institutions MIT, JSTOR, FBI over copyright, is not the same as what GP implied - many students, regular folks just like you and me, being forced to take their own lives due to legal consequences of pirating books in bulk. Nothing like the latter ever happened anyway.

We can do better than this.

(And even if we can't, I trust the courts can.)

--

[0] - Curiously, while doing some search now to be sure I didn't miss any similar case, I learned that JSTOR incident wasn't the first for 'aaronsw - apparently, he did the same thing a few years earlier with public court documents[1]; FBI investigated this too, and concluded he was legally in the clear. It's probably well-known to everyone here, but I somehow missed it, so #TodayILearned.

[1] - https://en.wikipedia.org/wiki/Aaron_Swartz#PACER

[2] - https://en.wikipedia.org/wiki/Edwin_Howard_Armstrong was the only one I could find that was even remotely related - an engineer and inventor who, in big part due to prolonged fighting over patents consuming all his time and money, suffered from a mental breakdown and committed suicide at 63.

southernplaces7•6h ago
>The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?

Honestly curious. Could you share any examples of these cases?

gosub100•1d ago
> that blends them thoroughly and irreversibly

It's okay, you can say 'laundering'

TeMPOraL•1d ago
I can, but I don't, because that's at best an unintended side effect.
r14c•1d ago
That's Uber's Gambit. Nothing is illegal for large enough corporations with strong network effects and deep pockets.
TeMPOraL•1d ago
That's not Uber's Gambit.

Uber was blatantly ignoring the local laws in order to break into the market and quickly defeat local competition. They used their infinite VC money supply to interfere with and delay investigations and enforcement, betting that if they do it fast enough, they'll have the general population on their side.

LLM vendors found and exploited[0] a legal uncertainty - correct me if I'm wrong, but AFAIK it still isn't settled whether or not their actions were actually illegal. Unlike Uber, LLM vendors aren't breaking into markets by ignoring the laws to outcompete incumbents, and burning stupid amounts of money just to get away with it. On the contrary, LLM vendors are simply providing an actually useful product, and charging a reasonable price for it, while reinvesting it into improving the product. Effects it has on other markets aside[1], their business model is just providing actual value in exchange for money. That's much more direct and honest than most of the tech industry.

The product itself is also different. Uber is selling a mirage, a "miracle" improvement that quickly turns not so, and is destined to eventually destroy the markets it disrupted. LLM vendors are developing and serving systems that provide actual value to users, directly and obviously so.

--

[0] - Probably walked into this without initially realizing it. No one complained 5-10 years ago, where the datasets were smaller and the resulting models had no real-world utility. It's only when the models became useful, that some people started looking for ways to make them go away.

[1] - That's an unfortunate effect of it being a general AI tool, and would be the same regardless of how it was created.

sellmesoap•1d ago
Ironically the low tech infringing proposal would lead to more reliable results grounded in the raw contents of the data, using less computing/power and without the confidently incorrect sycophanty we see from the LLMs.
TeMPOraL•1d ago
Nah. It would just lead to more of classical search. Which is okay, as it always has been.

LLMs are not retrieval engines, and thinking them as such is missing most of their value. LLMs are understanding engines. Much like for humans, evaluating and incorporating knowledge is necessary to build understanding - however, perfect recall is not.

Another, arguably equivalent way of framing it: the job of an LLM isn't to provide you with the facts; it's main job is to understand what you mean. The "WIM" in "DWIM". Making it do that does require stupid amounts of data and tons of compute in training. Currently, there's no better way, and the only alternative system with similar capabilities are... humans.

IOW, it's not even an apples to oranges comparison, it's apples to gourmet chef.

freedomben•1d ago
> > or some other country that doesn't respect international copyright though.

> Like the US? OpenAI et al. don't give a shit.

OpenAI is not a country and therefore cannot make laws that don't respect international (or domestic) copyright. Also the US is a lot bigger than OpenAI and the big tech corps, and the law is very much on the side of copyright holders in the US.

diggan•1d ago
> the law is very much on the side of copyright holders in the US.

Remind me again what the status of the case is with Meta/Facebook using pirated material to train their proprietary LLMs, and even seeding the data back to the community while downloading it?

SR2Z•1d ago
In progress. Nobody is expecting the original protections afforded by copyright to apply here, but the fact that the material is pirated is less relevant than whether or not an LLM is a transformative use of the material.

We will almost certainly see copyright law weakened by the case, but I do not believe that FB will get off with no penalties.

gosub100•1d ago
The money is definitely in the side of big tech vs book publishers. There may be a nominal settlement to end the matter, perhaps after a decade of litigation
andrepd•1d ago
> Would probably have to be done in Russia or some other country that doesn't respect international copyright though.

Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.

Yet some people still believe "it would have to be done in evil Russia".

DataDaoDe•1d ago
OP does have an exaggerated statement - its not like there aren't laws in Russia or something and I largely agree with your sentiment. I think there are levels to this though and its pretty clear that Russia is much riskier than the USA when it comes to IP - just look up anything to do with insuring IP risk in Russia (here's one such example: https://baa.no/en/articles/i-have-ip-in-russia-is-my-ip-at-r...)

Also according to the office of US trade representative, Russia is on the priority watch list of countries that do not respect IP [1] and post 2022, largely due to the war, Russia implemented measures negatively effecting IP rights. [2,3]

If you think it isn't the case and Russia is just as risky as the US when it comes to copyright and IP, I would be interested to know why.

1. https://ustr.gov/about/policy-offices/press-office/press-rel... 2. https://www.papula-nevinpat.com/executive-summary-the-ip-sit... 3. https://www.taftlaw.com/news-events/law-bulletins/russia-iss...

mdp2021•1d ago
> evil

In this case and context, a label like "evil" is a twisted interpretation.

carlosjobim•1d ago
> 1. Why? Who would use that?

Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?

People who need or want the highest quality information available will pay for it. I'd easily pay for it.

bbor•1d ago
1. It'd be for the scientific community (broadly-construed). Converting media that is currently completely un-indexed into plaintext and offering a suite of search features for finding content within it would be a game-changer, IMO! If you've ever done a lit review for any field other than ML, I'm guessing you know how reliant many fields are on relatively-old books and articles (read: PDFs at best, paper-only at worst) that you can basically only encounter via a) citation chains, b) following an author, or c) encyclopedias/textbooks.

2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.

In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.

Aachen•1d ago
> I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it.

You don't need to host copyrighted material. It's all about intent. The Pirate Bay is (imo correctly, even if I disagree with other aspects about copyright law and its enforcement) seen as a place where people go to find ways to not pay authors for their content. They never hosted a copyrighted byte but they're banned in some form (DNS, IP, domain seizures) in many countries. Proxies of TPB also, so being like an ISP for such a site is already enough, whereas nobody is ordering blocks of Comcast's IP addresses for providing access to websites with copyrighted material because they didn't have a somewhat-provable intent to provide copyright infringement

When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub, but I think you'd have to spin it as a general purpose search page and ideally not even mention AA was one of the sources, much less have links

(Don't get me wrong: everyone wants this except the lobby of journals that presently own the rights)

It would be a real shame if an anonymous third party that's definitely not the website operator made a Firefox add-on that illegitimately inserts these links to search results page though

DaSHacka•1d ago
> When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub

You could just give users ISBNs or link to the book's metadata on openlibrary[0], both of which AA's native search already does.

[0] https://openlibrary.org/

carlosjobim•10h ago
Exactly.

1. The ISBN in cleartext

2. An isbn://123123123 link

3. A link to the book on a legal library borrowing service

4. A link to buy the book on Amazon

coolThingsFirst•1d ago
Yeah but how does the search work, does it show a portion of the text? If it's a portion of the text isn't that also a part of the book?
1vuio0pswjnm7•1d ago
But he did not mention anything about creating a "service"

It could be his own copy for personal use

What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable

The data might seem large today, but it might not seem large or unmanageable in the future

tomthe•1d ago
I wonder if you could implement it with only static hosting?

We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.

Hosting this would be cheap, but the main barriers remain..

ThatPlayer•1d ago
I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.

I can even use Sqlite's full-text search capabilities!

showerst•1d ago
How would that scale to 10TB+ of plain text though? Presumably the indexes would be many gigabytes, especially with full text search.
wolfgang42•1d ago
The client only needs to get indexes for the specific search; if the index is just a list of TF-IDF term scores per document (which gets you a very reasonable start on search relevance) some extremely back-of-the-envelope math leads me to guess at an upper bound in the low tens of megabytes per (non-stopword) term, which seems doable for a client to download on demand.
qcic•1d ago
Super interesting.
Aachen•1d ago
I wonder if you could take this one step further and have opaque queries using homomorphic encryption on the index and then somehow extracting ranges around the document(s) you're interested in

Inspired by: "Show HN: Read Wikipedia privately using homomorphic encryption" https://news.ycombinator.com/item?id=31668814

greggsy•1d ago
It's trivial to normalise the various formats, and there were a few libraries and ML models to help parse PDFs. I was tinkering around with something like this for academic papers in Zotero, and the main issue I ran into was words spilling over to the next page, and footnotes. I totally gave up on that endeavour several years ago, but the tooling has probably matured exponentially since then.

As an example, all the academic paper hubs have been using this technology for decades.

I'd wager that all of the big Gen AI companies have planned to use this exact dataset, and many or them probably have already.

fake-name•1d ago
> It's trivial to normalise the various formats,

Ha. Ha. ha ha ha.

As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.

You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.

bawolff•1d ago
98% sounds good enough for the usecase suggested here.
pastage•1d ago
Writing good validators for data is hard. You can be 100% sure that there will be bad data in those 98%. From my own experience I thought I had 50% of the books converted correctly and then I found I still had junk data and gave up, it is not an impossible problem I just was not motivated to fix it on my own. Working with your own copies is fine, but when you try to share that you get into legal issues that I just do not feel are that interesting to solve.

Edit: my point is that I would like to share my work but that is hard to do in a legal way. That is the main reason I gave up.

landl0rd•1d ago
2% garbage, if some of that garbage falls out the right way, is more than enough to seriously degrade search result quality.
carlosjobim•1d ago
It's better than nothing, and nothing is what we currently have.
trollbridge•1d ago
Decent storage is $10/TB, so for $10,000 you could just keep the entire 1PB of data.

A rather obvious question is if someone has trained an LLM on this archive yet.

moffkalast•1d ago
A rather obvious answer is Meta is currently being sued for training Llama on Anna's archive.

You can be practically certain that every notable LLM has been trained on it.

rthnbgrredf•1d ago
> You can be practically certain that every notable LLM has been trained on it.

But only Meta was kind of not so smart to publicly admit it.

nextos•1d ago
AFAIK, Z-Library already does this, to some extent. Basic full-text queries do search inside the body of books and articles.

It's a bit smaller than Anna's Archive, as they do host their own collections. From some locations, it's only easy to access through Tor.

bravesoul2•1d ago
This works in various search engines

site:annas-archive.org avacado

qingcharles•1d ago
It's not exactly clear, but OP is asking about indexing the content of all the documents, not the metadata (e.g. titles etc)
imdavidsantiago•1d ago
As far as I know, no one has fully implemented full-text search directly over Anna's Archive. Technically it’s feasible with tools like Meilisearch, Elasticsearch, or Lucene, but the main challenges are:

    Converting all documents (PDFs, EPUBs, etc.) to clean plaintext.

    Indexing at scale efficiently.

    Managing potential legal issues.
Z-Library does something similar, but it’s smaller in scope and doesn't integrate AA’s full catalog.
bendangelo•1d ago
I’ve done something like this before. Meilisearch will not be viable, because it indexes very slow and it takes up a lot of space.

In my experience only Tantivy can index this much data. Check out Lnx.

sam_lowry_•1d ago
Lucene would fo fine as well, I guess. As much as I like the author of Tantivy, it is a toy compared to Lucene.
_ache_•1d ago
To manage the legal issues, you just have to put AI on the search. "AI search".
DaSexiestAlive•1d ago
Mebbe easier to just search Amazon or Goodreads. Like site:amazon.ca <query words> as someone has mentioned below.

Every book has an ISBN 10 or 13 digit ISBN number to identify them. Unless it's some self-pub/amateur-hour situation by some paranoid prepper living in a faraday-cage-protected cage in Arkansas or Florida it's likely a publication with a title, an author and an ISBN number.

pigeons•1d ago
What about pre-1970 books?
trollbridge•1d ago
A self-pub amateur-hour book printed by a paranoid prepper living in a faraday cage is exactly the type of book I'd probably enjoy reading, but I doubt these exist anymore.
pigeons•17h ago
I know, remember Loopanics?
renegat0x0•1d ago
I have found some searche engines, but I do not think they're for Anna's.

https://searchthearxiv.com/

https://refseek.com/

https://arxivxplorer.com/

simgt•1d ago
Related question, has Anna's archive been thoroughly filtered for non-copyright-related illegal material? Pedo, terrorism, etc. I've considered downloading a few chunks of it but I'm worried of ending up with content I really don't want to be anywhere near from.
niux•1d ago
How might you inadvertently download illegal content while searching for legal content?
lukan•1d ago
He said he wants to download lots of it in general, not specifical. Legit question, if you end up with dark material.

I would assume pedo stuff is not really there, but the anarchist cookbook and alike likely will be.

oguz-ismail•1d ago
>I would assume pedo stuff is not really there

Search for "lolicon"

lukan•1d ago
Well, I won't. But does it contain just text or real pictures? That would make a big legal difference I assume.
qingcharles•23h ago
It makes very little legal difference in a lot of jurisdictions. They are considered the same.
jxjnskkzxxhx•1d ago
I thought that was anime pictures...?
areyourllySorry•1d ago
a subset of that, yes. but that label implies more than just that
qingcharles•23h ago
That doesn't matter. They are still illegal in a great number of jurisdictions, including large portions of the USA.
jxjnskkzxxhx•16h ago
Citation needed.
DocTomoe•1d ago
Considering the anarchist cookbook is just a rebranded selection of freely-available US Army Field Manuals, ... I don't see the problem.
lukan•1d ago
I don't either, but many states have laws regarding books on how to build bombs and they might get enforced more than copyright.
srum•1d ago
You can get in trouble for having it in the UK (though not necessarily convicted)

https://news.sky.com/story/anarchist-cookbook-case-student-j...

https://www.bbc.co.uk/news/uk-england-oxfordshire-45841291

gosub100•1d ago
Not saying you're deceiving but can you show me where a state has made a book about bombs illegal? It seems like that would be a slam dunk 1A violation. And yes I'm aware that states willfully violate 2A but I don't want to discuss it here.
lukan•1d ago
Not saying you cannot read, but if you would, the other answer to my comment literally has such an example.

Germany is like this as well since a few years.

Not all states are within the US.

gosub100•1d ago
you're not saying I cannot read, but that type of inflection is uncalled for. You have been reported to the mods.
lukan•15h ago
Uncalled for?

"Not saying you're deceiving"

Right next to your answer were you implied I might be deceiving was already an answer telling apparently I was not deceiving. So yes, the mocking of my comment was not up to HN standards, but you don't see how you started it?

tomhow•10h ago
I think you're each setting the other off and being a bit overreactive to each other's comments, and I think there may be a misunderstanding of the other's intent. We still need to make an effort to observe the guidelines even if a reply to us comment seems hostile. Sometimes it's best to just to stop.
lukan•9h ago
Indeed. Actually I did not try to attack. My intention was showing mild irony to something I perceived as an attack.

Those nuances easily get lost in text, I know, but that my post got flagged and his initial one did not, I really did not liked and that angered me a bit. But I can live with that, without making a drama out of it. Thanks for trying to mediate.

tomhow•9h ago
Thanks for being good about it. I've turned off the flags on that comment.
bilekas•1d ago
I'm still not sure the question makes much sense, if it's a general: "I want to support the project and so I want to seed a large chunk" Okay, I guess it's your due diligence to check, but there is a reporting feature built in, if something is found, report it.

Aside from that, if you're searching for specific content, the question is moot I guess.

I guess my confusion is what distinguishes this apart from any other torrent ? That is, if the submitted content is submitted at all.

lukan•1d ago
I understood it as he or she wants to download large chunks of potentially interesting books for offline use, or once Anna goes down. So a broad filter. Not for seeding.

But thanks for the explanation that there is a report build in.

qingcharles•23h ago
There is illegal stuff in there, because some of the books they've swept up into the archive have been retconned to be illegal. A lot of obscene material wasn't illegal until into the 80s, and these shadow libraries are scooping up everything without checking. I don't know what they do if you report it to them.

For instance, there are issues of multiple issues of Playboy with underage models. All the archives of British tabloid newspapers had to be purged in 2003 after the laws changed there, etc.

bordercases•1d ago
Seeding torrrent blocks.
bilekas•1d ago
This is a really strange question to be honest you could ask this literally about any download let alone simply torrents of documents.
gosub100•1d ago
It's the textbook example of the "chilling effect" created by mass surveillance.
dns_snek•1d ago
Download everything, we know that laws don't apply when you do it on a large enough scale. Not legal advice.
lukan•1d ago
I think you got that wrong. Laws only don't apply if you are large enough. (Like Meta)
gosub100•1d ago
The team that curates it is very dedicated and wouldn't do such a thing. The least of reasons being they don't want the heat from it.

I'm not sure what other forms of information is illegal beyond CP. In the US, bomb making instructions are not illegal. In other dictatorships or zealous religious regimes, information about democracy or works that insult Islam might be illegal

allenleein•1d ago
Has anyone explored a different angle — like mapping out the 1,000 most frequently mentioned or cited books (across HN, Substack, Twitter, etc.), then turning their raw content into clean, structured data optimized for LLMs? Imagine curating these into thematic shelves — say, “Bill Gates’ Bookshelf” or “HN Canon” — and building an indie portal where anyone can semantically search across these high-signal texts. Kind of like an AI-searchable personal library of the internet’s favorite books.
DocTomoe•1d ago
Well, there's this: https://hacker-recommended-books.vercel.app/category/0/all-t...
laserstrahl•1d ago
There’s an android app called OpenLip. [1]

Description:

Openlib is an open source app to download and read books from shadow library (Anna’s Archive). The App Has Built In Reader to Read Books.

As Anna’s Archive doesn't have an API, the app works by sending requests to Anna’s Archive and parses the response to objects. The app extracts the mirrors from the responses, downloads the book and stores it in the application's document directory.

Note : The app requires VPN to function properly . Without VPN the might show the captcha required page even after completing the captcha

Main Features:

Trending Books

Download And Read Books With In-Built Viewer

Supports Epub And Pdf Formats

Open Books With Your Favourite Ebooks Reader

Filter Books

Sort Books

[1]: https://f-droid.org/de/packages/com.app.openlib/

petra•1d ago
Z-Library has a keyword search. Personally i didn't find it too useful, especially given Google Books exists. It's not easy to create a quality book search engine.
podgorniy•1d ago
There is a search solution for zipped fb2 files. Not exactly what you need, but it has potential.

The project has similar story to Anna's archive. There is 0.5 TB of archived books, and the project creates index of all the books with text, title and aruthor search capabilities, gives html UI for search and reading. On weak machine it takes about 2 hours to build that index.

So if you have zipped archives of fb2, you can use the project to create web UI with search for those files. Without need of enough space to unpack all the files.

You'll have to translate some russian though to get instructions on how to set it up.

https://gitlab.com/opennota/fb2index/-/blob/master/README.ru...

tangus•1d ago
But fb2 files are marked up text, which is (relatively) trivial to index. The bulk of Anna's Archive's books are made of from scanned images.
jmb99•1d ago
Worth mentioning that 0.5TB is tiny compared to Anna’s, which currently sits around 1.1PB.
Quin-tus•1d ago
https://book-finder.tiiny.site/

More: https://rentry.co/StellaOctangulaIsCool

HelloUsername•1d ago
> https://book-finder.tiiny.site/

That just redirects to https://yandex.com/search

Quin-tus•1d ago
Well yeah but with a specific query with which you can search multiple libraries
xbmcuser•1d ago
Facebook did it's ai is trained on it so you can use that.
teekert•1d ago
Probably this was already done at Google, Meta, X and OpenAI, before training their LLMs.
maartin0•1d ago
There's actually section in the Wikipedia page that explicitly says DeepSeek was trained on it
net01•1d ago
They did! They conducted a competition https://annas-archive.org/blog/all-isbns-winners.html , in which a few submissions exceeded the minimum requirements and implemented a good search tool & visualiser.
carlosjobim•1d ago
How is this a text search of the books?
outside1234•1d ago
The original question the poster made was not clear, so this is also an answer to it. It depends on what they meant by "search"
perdomon•1d ago
I think OP was more interested in the ability to text search through the contents. This competition was great and some of the entries were really informative, but none of them included a full text search of the contents of all books.
carlosjobim•1d ago
A functional full text search of the shadow libraries would be massive. It would have a comparable impact on humanity to the impact AI will have. And it's probably not difficult technically. Let's start a project to get this done!

Edit: I have had this exact project as my dream for a couple of years, and even experimented a little bit. But I'm not a programmer, so I can only understand theoretically what would be needed for this to work.

Anybody with the same dream, send me an e-mail to booksearch@fastmail.com and let's see what we can do to get the ball rolling!

1970-01-01•1d ago
Don't do it. Just because you can, doesn't mean you should. Do you know if they have anywhere near the legal muscle to push back the flood of legal notices if you did this? Assume it survives because it doesn't have a wide open barn door to the public.
bethekidyouwant•1d ago
It wouldn’t be called full text search of AA, It would be called full tech search of every book in the world.
1970-01-01•1d ago
You are asking a judge to consider that a book is ok to scrape because it's part of a much larger collection of books, perhaps the biggest and best collection, and therefore it's all OK because at scale means good.
calibas•1d ago
Google already successfully argued in court that creating an online search index of books constitutes fair use: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
1970-01-01•1d ago
The index isn't the full content. The OP search is about indexing the entire contents of the book for verbatim information retrieval.
hbartab•1d ago
Seeing as OpenAI & Co were trained on torrented books from similar places, I'm sure that ChatGPT provides an adequate search layer on top of Anna's Archive, though it is not as free from confabulations as one might hope for in a search engine.

Edit: grammar

underlines•1d ago
yes, every major llm company did it:

illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.

whimsicalism•1d ago
small number of people willing to put in significant engineering hours for something that would be illegal and non-monetizable
coolThingsFirst•1d ago
No, because you can't avert the legal issues of doing that.
jimmydoe•1d ago
Facebook said they leeched it, and Anna once mentioned a few companies most of them from China paid for it, so I assume the answer is yes someone has the data and very likely built the search, but no one will open it given the legal and reputation risk.
fazlerocks•21h ago
The indexing costs would be nuts - Anna's Archive is like 200TB+ and growing fast. Even with decent search infra you're looking at serious compute/storage costs. Plus there's the obvious legal stuff that would make this a no-go for most companies with anything to lose. The decentralized thing they're doing probably makes way more sense.
carlosjobim•9h ago
How serious compute/storage cost? $10 000 per month? $100 000 per month?