What would it cost?
What would it cost?
The main barriers are going to be reliably extracting plaintext from the myriad of formats in the archive, cleaning up the data, and selecting a decent full text search database (god help you if you pick wrong and decide you want to switch and re-index everything later).
1. There’s a lot of variants of the same book. We only need one for the index. Perhaps for each ISBN, select the format easiest to parse.
2. We can download, convert and index top 100K books first, launch with these, and then continue indexing and adding other books.
But it’s not a huge problem actually: just link to the search page instead and let the user decide what they want to download.
I tried to automate downloading just a thousand books and it was unbearably slow, from IPFS or the mirrors both. I ended up picking the individual files out of the torrents. Even just identifying or deduping the top 100k would be a significant task.
collection “ia” → torrent “annas-archive-ia-acsm-n.tar.torrent” → file “annas-archive-ia-acsm-n.tar” (extract) → file “notesonsynthesis0000unse.pdf”
But probably you should get it from the database dumps they provide instead of hammering the website.
So you come up with a list of books you want to prioritize, search the DB for torrent name and file to download, download only the files you need, and extract them. You’ll probably end up with quite a few more books, which you may index or skip for now, but it is certainly doable.
I count 80+ editions of J.R.R. Tolkien's _The Hobbit_ at:
https://tolkienlibrary.com/booksbytolkien/hobbit/editions.ph...
granted some predate ISBNs, one is the 3D pop-up version, so not a traditional text, and so forth, but filtering by ISBN will _not_ filter out duplicates.
There is also the problem of the same work being published under multiple titles (and also ISBNs) --- Hal Clement's _Small Changes_ was re-published as _Space Lash_ and that short story collection is now collected in:
https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...
along with others.
I wish there was some better book cataloging/organizing scheme --- the Online Books Page uses LoC:
https://onlinebooks.library.upenn.edu/subjects.html
and is the most workable of the indices I've used.
1. Why? Who would use that? What’s the problem with the other search engines? How will it be paid for?
2. Potential legal issues.
The technical barriers are at least challenging and interesting.
Providing a service with significant upfront investment needs with no product or service vision that I’ll likely to be sued for a couple of times a year, probably losing with who knows what kind of punishment… I’ll have to pass unfortunately.
https://torrentfreak.com/meta-torrented-over-81-tb-of-data-t...
I doubt they'd resort to piracy for what is basically the same information as what they've already legally acquired...
But here, Google isn't a license holder. Google doesn't license the text in Google Books (unless something has changed since the lawsuits). Google simply legally acquires (buys, borrows, etc) a copy of the book and does things with it that the US courts have found are fair use and require no license.
Incidentally I believe the French courts disagreed and fined them half a million dollars or so and ordered them to stop in France.
https://www.forbes.com/sites/danpontefract/2025/03/25/author...
Like the US? OpenAI et al. don't give a shit.
It can be roughly translated to "you don't understand, it's a completely different situation". That's what's constantly on my mind when I'm reading discussions like this one.
Everybody and their dog torrenting petabytes of data and getting away with it (Meta is the only one that got caught and they've still gotten away with doing it)?
The very same data poor American students were forced to commit suicide over? The same data that average American housewives were sued over for millions of dollars of "damages"? The same data that often gets random German plumbers or steelworkers to pay thousands of euros of "fines" to the copyright mafia so they won't get sued and have their lives ruined?
Yet when giant corporations are doing the exact same thing on a massive scale, it's fine? It's not even the same thing, an American student torrenting books isn't making any money off it, while Meta very much is.
Of course it's not the same, a simple-minded and poorly educated person like me isn't capable of understanding the difference. You keep believing in your moral superiority, the rest of the world has finally woken up.
The details are the entire point. Arguing that a corporation can get away doing something, while an individual can't, isn't useful, because there are great many of such somethings, and in most cases it turns out perfectly reasonable, once you dig into details.
Leaving the rest of your argument aside, precisely nobody forced aaronsw to commit suicide.
Honestly was too tired to point that out in my earlier reply, but that's exactly the kind of argument you get when people are not willing (or purposefully refusing) to consider details. Intentionally or not, you get bogus and highly manipulative statements.
A single case of a student activist fighting for freedom of communication and access to public goods for citizens, ending up breaking under pressure from public/non-profit institutions MIT, JSTOR, FBI over copyright, is not the same as what GP implied - many students, regular folks just like you and me, being forced to take their own lives due to legal consequences of pirating books in bulk. Nothing like the latter ever happened anyway.
We can do better than this.
(And even if we can't, I trust the courts can.)
--
[0] - Curiously, while doing some search now to be sure I didn't miss any similar case, I learned that JSTOR incident wasn't the first for 'aaronsw - apparently, he did the same thing a few years earlier with public court documents[1]; FBI investigated this too, and concluded he was legally in the clear. It's probably well-known to everyone here, but I somehow missed it, so #TodayILearned.
[1] - https://en.wikipedia.org/wiki/Aaron_Swartz#PACER
[2] - https://en.wikipedia.org/wiki/Edwin_Howard_Armstrong was the only one I could find that was even remotely related - an engineer and inventor who, in big part due to prolonged fighting over patents consuming all his time and money, suffered from a mental breakdown and committed suicide at 63.
Honestly curious. Could you share any examples of these cases?
It's okay, you can say 'laundering'
Uber was blatantly ignoring the local laws in order to break into the market and quickly defeat local competition. They used their infinite VC money supply to interfere with and delay investigations and enforcement, betting that if they do it fast enough, they'll have the general population on their side.
LLM vendors found and exploited[0] a legal uncertainty - correct me if I'm wrong, but AFAIK it still isn't settled whether or not their actions were actually illegal. Unlike Uber, LLM vendors aren't breaking into markets by ignoring the laws to outcompete incumbents, and burning stupid amounts of money just to get away with it. On the contrary, LLM vendors are simply providing an actually useful product, and charging a reasonable price for it, while reinvesting it into improving the product. Effects it has on other markets aside[1], their business model is just providing actual value in exchange for money. That's much more direct and honest than most of the tech industry.
The product itself is also different. Uber is selling a mirage, a "miracle" improvement that quickly turns not so, and is destined to eventually destroy the markets it disrupted. LLM vendors are developing and serving systems that provide actual value to users, directly and obviously so.
--
[0] - Probably walked into this without initially realizing it. No one complained 5-10 years ago, where the datasets were smaller and the resulting models had no real-world utility. It's only when the models became useful, that some people started looking for ways to make them go away.
[1] - That's an unfortunate effect of it being a general AI tool, and would be the same regardless of how it was created.
LLMs are not retrieval engines, and thinking them as such is missing most of their value. LLMs are understanding engines. Much like for humans, evaluating and incorporating knowledge is necessary to build understanding - however, perfect recall is not.
Another, arguably equivalent way of framing it: the job of an LLM isn't to provide you with the facts; it's main job is to understand what you mean. The "WIM" in "DWIM". Making it do that does require stupid amounts of data and tons of compute in training. Currently, there's no better way, and the only alternative system with similar capabilities are... humans.
IOW, it's not even an apples to oranges comparison, it's apples to gourmet chef.
> Like the US? OpenAI et al. don't give a shit.
OpenAI is not a country and therefore cannot make laws that don't respect international (or domestic) copyright. Also the US is a lot bigger than OpenAI and the big tech corps, and the law is very much on the side of copyright holders in the US.
Remind me again what the status of the case is with Meta/Facebook using pirated material to train their proprietary LLMs, and even seeding the data back to the community while downloading it?
We will almost certainly see copyright law weakened by the case, but I do not believe that FB will get off with no penalties.
Incredible, several years of major American AI companies showing that flaunting copyright only matters if it's college kids torrenting shows or enthusiasts archiving bootlegs on whatcd, but if it's big corpos doing it it's necessary for innovation.
Yet some people still believe "it would have to be done in evil Russia".
Also according to the office of US trade representative, Russia is on the priority watch list of countries that do not respect IP [1] and post 2022, largely due to the war, Russia implemented measures negatively effecting IP rights. [2,3]
If you think it isn't the case and Russia is just as risky as the US when it comes to copyright and IP, I would be interested to know why.
1. https://ustr.gov/about/policy-offices/press-office/press-rel... 2. https://www.papula-nevinpat.com/executive-summary-the-ip-sit... 3. https://www.taftlaw.com/news-events/law-bulletins/russia-iss...
In this case and context, a label like "evil" is a twisted interpretation.
Rather who would use a traditional search engine instead of a book search engine, when the quality of the results from the latter will be much superior?
People who need or want the highest quality information available will pay for it. I'd easily pay for it.
2. I really don't see how this could ever lead to any kind of legal issue. You're not hosting any of the content itself, just offering a search feature for it. GoodReads doesn't need legal permission to index popular books, for example.
In general I get the sense that your comment is written from the perspective of an entrepreneur/startup mindset. I'm sure that's brought you meaning and maybe even some wealth, but it's not a universal one! Some of us are more interested in making something to advance humanity than something likely to make a profit, even if we might look silly in the process.
You don't need to host copyrighted material. It's all about intent. The Pirate Bay is (imo correctly, even if I disagree with other aspects about copyright law and its enforcement) seen as a place where people go to find ways to not pay authors for their content. They never hosted a copyrighted byte but they're banned in some form (DNS, IP, domain seizures) in many countries. Proxies of TPB also, so being like an ISP for such a site is already enough, whereas nobody is ordering blocks of Comcast's IP addresses for providing access to websites with copyrighted material because they didn't have a somewhat-provable intent to provide copyright infringement
When I read the OP, I imagine this would link from the search results directly to Anna's archive and sci-hub, but I think you'd have to spin it as a general purpose search page and ideally not even mention AA was one of the sources, much less have links
(Don't get me wrong: everyone wants this except the lobby of journals that presently own the rights)
It would be a real shame if an anonymous third party that's definitely not the website operator made a Firefox add-on that illegitimately inserts these links to search results page though
You could just give users ISBNs or link to the book's metadata on openlibrary[0], both of which AA's native search already does.
1. The ISBN in cleartext
2. An isbn://123123123 link
3. A link to the book on a legal library borrowing service
4. A link to buy the book on Amazon
It could be his own copy for personal use
What if computers continue to become faster and storage continues to become cheaper; what if "large" amounts data continue to become more manageable
The data might seem large today, but it might not seem large or unmanageable in the future
We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.
Hosting this would be cheap, but the main barriers remain..
I can even use Sqlite's full-text search capabilities!
Inspired by: "Show HN: Read Wikipedia privately using homomorphic encryption" https://news.ycombinator.com/item?id=31668814
As an example, all the academic paper hubs have been using this technology for decades.
I'd wager that all of the big Gen AI companies have planned to use this exact dataset, and many or them probably have already.
Ha. Ha. ha ha ha.
As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.
You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.
Edit: my point is that I would like to share my work but that is hard to do in a legal way. That is the main reason I gave up.
A rather obvious question is if someone has trained an LLM on this archive yet.
You can be practically certain that every notable LLM has been trained on it.
But only Meta was kind of not so smart to publicly admit it.
It's a bit smaller than Anna's Archive, as they do host their own collections. From some locations, it's only easy to access through Tor.
site:annas-archive.org avacado
Converting all documents (PDFs, EPUBs, etc.) to clean plaintext.
Indexing at scale efficiently.
Managing potential legal issues.
Z-Library does something similar, but it’s smaller in scope and doesn't integrate AA’s full catalog.In my experience only Tantivy can index this much data. Check out Lnx.
Every book has an ISBN 10 or 13 digit ISBN number to identify them. Unless it's some self-pub/amateur-hour situation by some paranoid prepper living in a faraday-cage-protected cage in Arkansas or Florida it's likely a publication with a title, an author and an ISBN number.
I would assume pedo stuff is not really there, but the anarchist cookbook and alike likely will be.
Search for "lolicon"
https://news.sky.com/story/anarchist-cookbook-case-student-j...
Germany is like this as well since a few years.
Not all states are within the US.
"Not saying you're deceiving"
Right next to your answer were you implied I might be deceiving was already an answer telling apparently I was not deceiving. So yes, the mocking of my comment was not up to HN standards, but you don't see how you started it?
Those nuances easily get lost in text, I know, but that my post got flagged and his initial one did not, I really did not liked and that angered me a bit. But I can live with that, without making a drama out of it. Thanks for trying to mediate.
Aside from that, if you're searching for specific content, the question is moot I guess.
I guess my confusion is what distinguishes this apart from any other torrent ? That is, if the submitted content is submitted at all.
But thanks for the explanation that there is a report build in.
For instance, there are issues of multiple issues of Playboy with underage models. All the archives of British tabloid newspapers had to be purged in 2003 after the laws changed there, etc.
I'm not sure what other forms of information is illegal beyond CP. In the US, bomb making instructions are not illegal. In other dictatorships or zealous religious regimes, information about democracy or works that insult Islam might be illegal
Description:
Openlib is an open source app to download and read books from shadow library (Anna’s Archive). The App Has Built In Reader to Read Books.
As Anna’s Archive doesn't have an API, the app works by sending requests to Anna’s Archive and parses the response to objects. The app extracts the mirrors from the responses, downloads the book and stores it in the application's document directory.
Note : The app requires VPN to function properly . Without VPN the might show the captcha required page even after completing the captcha
Main Features:
Trending Books
Download And Read Books With In-Built Viewer
Supports Epub And Pdf Formats
Open Books With Your Favourite Ebooks Reader
Filter Books
Sort Books
The project has similar story to Anna's archive. There is 0.5 TB of archived books, and the project creates index of all the books with text, title and aruthor search capabilities, gives html UI for search and reading. On weak machine it takes about 2 hours to build that index.
So if you have zipped archives of fb2, you can use the project to create web UI with search for those files. Without need of enough space to unpack all the files.
You'll have to translate some russian though to get instructions on how to set it up.
https://gitlab.com/opennota/fb2index/-/blob/master/README.ru...
That just redirects to https://yandex.com/search
Edit: I have had this exact project as my dream for a couple of years, and even experimented a little bit. But I'm not a programmer, so I can only understand theoretically what would be needed for this to work.
Anybody with the same dream, send me an e-mail to booksearch@fastmail.com and let's see what we can do to get the ball rolling!
Edit: grammar
illegally using annas archive, the pile, common crawl, their own crawl, books2, libgen etc. and embed it into high dimensional space and do next token prediction on it.
ggm•1d ago
The thing is AA doesn't hold the texts. They're disputable IPR and even a derived work would be a legal target.
carlosjobim•1d ago
Why would it? Google isn't prosecuted for indexing the web.
1970-01-01•1d ago
trollbridge•1d ago
1970-01-01•1d ago