Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

133•costco•1d ago

Hi everyone,

For the past couple months I've been working on a website with two main features:

- https://book.sv - put in a list of books and get recommendations on what to read next from a model trained on over a billion reviews

- https://book.sv/intersect - put in a list of books and find the users on Goodreads who have read them all (if you don't want to be included in these results, you can opt-out here: https://book.sv/remove-my-data)

Technical info available here: https://book.sv/how-it-works

Note 1: If you only provide one or two books, the model doesn't have a lot to work with and may include a handful of somewhat unrelated popular books in the results. If you want recommendations based on just one book, click the "Similar" button next to the book after adding it to the input book list on the recommendations page.

Note 2: This is uncommon, but if you get an unexpected non-English titled book in the results, it is probably not a mistake and it very likely has an English edition. The "canonical" edition of a book I use for display is whatever one is the most popular, which is usually the English version, but this is not the case for all books, especially those by famous French or Russian authors.

Comments

thinkcontext•1d ago

I'm impressed! It didn't take many books for it to start suggesting other books that I liked and it showed me several solid choices I'm adding to my queue.

aj_hackman•1h ago

Thank you! Because of this, "The Making of Prince of Persia: Journals 1985–1993" by Jordan Mechner is on its way to my house.

qingcharles•1h ago

You definitely will not regret that purchase. It's a very enjoyable read.

jamesponddotco•1h ago

The recommendations are pretty good; even though I only input six books, it was enough for it to recommend books I have on my wish list. Definitely going to play around some more. Plus, the website is super fast, very impressive.

Any chance we could get an API going at some point? Are you planning to open source the work?

I'm interested in the scrapping of Goodreads too. I'm building a book metadata aggregation API and plan on building a scrapper for Goodreads, but I imagine using a data center IP address will be a problem very fast. Were you scrapping from your home network?

costco•1h ago

Thank you for the compliments :) I used 50-100 datacenter proxies. I just logged requests made by the iOS app with Charles and then recreated the headers to the best of my ability though the server did not seem to be very strict at all. Worth noting though that static residential proxies are not too expensive these days anyways.

Re the API: The model does actually run fairly well on CPU so it probably wouldn't be too expensive to serve. I guess if there is demand for it I could do it. I think most social book sites would probably like to own their recommendation system though.

goatsi•1h ago

Speaking of sustained scraping for AI services, I found a strange file on your site: https://book.sv/robots.txt. Would you be able to explain the intent behind it?

costco•42m ago

I didn't want an agent to get stuck on an infinite loop invoking endpoints that cost GPU resources. Those fears are probably unfounded, so if people really cared I could remove those. /similar is blocked by default because I don't want 500000 "similar books for" pages to pollute the search results for my website but I do not mind if people scrape those pages.

dbl000•47m ago

I would love an API or the dataset if you could share it somehow! Just to play around with my own book lists.

esafak•1h ago

It is interesting that you chose a contextual recommender when you would think book affinity is not very susceptible to context. Did you try other models too?

skerit•1h ago

Please make this for tv series too!

vessenes•1h ago

OK, I just added books until you told me I had too many. Fun idea! I have a couple of suggestions:

* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.

* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.

I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.

Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.

comrade1234•1h ago

I gave up on goodreads reviews. I've been burned too many times by highly rated books that weren't that good. If you're into (horny) ya romance fantasy then goodreads is great, but it's not for me. I haven't really found a substitute.

jamesponddotco•1h ago

I'm not into the social aspect, so Goodreads was never an option, but Hardcover[1] seems like a pretty good alternative.

[1]: https://hardcover.app

owenversteeg•1h ago

Any broadly used ratings system is total garbage. Goodreads ratings, Google Maps ratings, Amazon reviews, Vivino for wine, et cetera. Even assuming the reviews are real and genuine, most people just aren’t good at writing reviews, and the handful that are often have wildly different criteria than you. Someone already commented with one enthusiast site - and sure, enthusiast sites are often better than the mainstream option (see also: CellarTracker for wine) but honestly my advice is to get good at determining the quality of the thing yourself. For books there are a ton of hints about what you’ll be getting. “NYT Bestseller”, “xyz book club”, certain publishers, who’s quoted on the back, when was it published, who wrote it? All of those things can help you rapidly identify books. I personally dislike most modern books and prefer the “classics”, so a lot of this is only useful as a negative signal, but even then there are positive signals, for example a reference to a much older book.

HeinzStuckeIt•56m ago

GR is also great if you are into academic nonfiction, Classics, poetry, etc. The site does, after all, let you track and review any publication with an ISBN. What my peers and I use it for is worlds apart from the romance novel or LGBT young-adult book reviewing community that often puts GR in the news, and far away from all the drama that rages around genre fiction.

noir_lord•1h ago

It has a tendency to recommend books in the same series as are input (putting aside that if I like a book in a series I've likely already read the series).

It did suggest Murderbot Diaries (not on the input but a series I have read and did like) and an Adrian Tchaikovsky I hadn't read :).

bananaflag•1h ago

Yeah the hardest problem for recommendation systems is to find non-Star Wars books which are like some specific Star Wars books and unlike some other Star Wars books. I would say it's AGI-complete ;)

noir_lord•18m ago

Ironically that is one of the few uses where I've found an LLM to actually be useful.

ChatGPT does a fairly good job at letting you negate/refine whatever it was you where looking for.

costco•1h ago

It's explicitly trained to predict the next book read in a sequence, which is why you get that behavior. There's probably a better way for me to handle it rather than having 5 books from the same series tend towards the top though.

noir_lord•29m ago

If you have the data to know the other books in a series maybe split the results so you have "books in series" in one column and "books not in a series mentioned" in the other but other than that it did a better job than Kindle recommendations which are often hilariously off the mark.

walthamstow•1h ago

Works pretty well with cookbooks. Very cool work.

One suggestion would be to make the search less strict on diacritics. Searching for popular cook J. Kenji López Alt was only successful if I entered the correct O.

NitpickLawyer•1h ago

Interesting. I tested it with sci-fi, and it definitely recommends good books, but not sure how accurate it is at surfacing the sub genres / themes. For example for [aurora -ksr, seveneves, project hail mary, ender's game] it gave me dune. Which is a great book, but not in the "first-ish contact" style I hoped it would be.

Another thing I noticed is that it tends to recommend 2nd and 3rd books in a series, which is a bit so-so. If I add the first book in a series, I probably already read the whole series...

28304283409234•1h ago

Came here to say this (recommending book 2 and 3 in a trilogy). Great app otherwise!

qingcharles•1h ago

I put in a bunch of books and hit recommendations and... I'd already read 95% of them, so at least we know it works well! (checking out the other 5% now)

p.s. one idea: when you click [Add] on the recommended books list, it should remove it from that list

p.p.s. if there is a way to filter out the spam "Summary of ____" books, that would be good too

jacquesm•43m ago

I have a hard time remembering titles of books I've read if they are not directly related to the subject matter. No problem remembering the content though. With movies I remember both.

yoz-y•1h ago

It works pretty well in the sense that after inputting only a few quite diverse books it gave me recommendations for a lot of books that I’ve already also read and enjoyed.

I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.

Overall quite impressive.

momocowcow•1h ago

Whatever I put in, it wants me to read Sapiens :_(

skayvr•1h ago

I've worked in recommender systems for a while, and it's great to see them publicized.

SASRec was released in 2018 just after transformer paper, and uses the same attention mechanism but different losses than LLMs. Any plans to upgrade to other item/user prediction models?

costco•1h ago

I'm not an expert by any means but as far as sequential recommendations go, aren't SASRec and its derivatives pretty much the name of the game? I probably should have looked into HSTUs more. Also this / sparse transformers in general: https://arxiv.org/pdf/2212.04120

bigskydog•1h ago

Recommend OneRec which is an improvement of HSTU and it recently became open source

skayvr•55m ago

There's a few alternatives, but SASRec is a good baseline for next-item recommendation. I'd look at BERT4Rec too. HSTU is definitely a strong step forward, but stays in the domain of ID models. HSTU also seems to rely heavily on some extra item information that SASRec does not (timestamps).

Other models include Google's TIGER model which uses a VAE to encode more information about items. Similar to how modern text-to-voice operates.

costco•10m ago

Thank you for the recommendations. I didn't try BERT4Rec because I assumed it would perform the same or worse as what I already had after having read https://dl.acm.org/doi/pdf/10.1145/3699521. The TIGER paper seems interesting - I definitely want to explore semantic IDs in general and also because I think it could allow including more long-tail items.

varenc•1h ago

I love this site, and the approach! Great seeing someone making good use of Goodreads data.

Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.

costco•1h ago

I think I will expand the input books limit (sadly requires retraining) and or the output books limit of 30.

nsypteras•1h ago

I'm impressed it recommended so many books i've already read and liked! I have a big reading backlog but once it's whittled down I will likely come back to this. One feature request would be to also show a "why this is recommended" for each recommendation so I can further narrow down the list for what I'm looking for

mcbrit•1h ago

I don't know. I entered, trying to be popular but at least slightly? opiniated:

Tigana, Hyperion, A Fire Upon the Deep, Blindsight, Moby Dick

and I got a list. Sure, read all that or wasn't interested for reasons, I added (only Neuromancer on initial recommendations):

Neuromancer, VALIS, Quantum Thief, Towing Jehovah.

List did not get more interesting.

Book recommendations are still kind of difficult.

mcbrit•1h ago

If I provide that list, a (real) person doesn't ask me if I've read the Hobbit.

teaearlgraycold•1h ago

I don’t think past liked books are nearly enough information to provide a good book for you today. You need a lot more information about the state of someone’s mind.

mcbrit•1h ago

You're talking to a dude. (in my case.) I mentioned 8 books.

I won't tell you exactly what to do, but one way to do it is to measure your surprise with me choosing each of those 8 books when you provide a recommendation back to me of what I should read next. I think I get kind of that experience talking to someone about books.

The algorithm didn't do that.

teaearlgraycold•26m ago

Talking to someone about books gives you so much more information than a book list. Their expressions, their accent, their energy level, their clothes, and many other things help to provide supplemental information.

submeta•1h ago

Like the idea! Wondering: Weren’t the early LLMs trained on data in Goodreads as well? I can upload and ask ChatGPT as well, and it will give me similar recommendations, no?

djoldman•1h ago

Can you share the details about the Meilisearch instance? How big is the box and database size?

costco•58m ago

Everything (namely Meilisearch, Postgres and the web server in Go) besides the model inference is running on a Hetzner server with a large SSD and an "AMD Ryzen 7 3700X 8-Core Processor." The data.ms directory is about 40GB. Once the HN traffic dies down I will probably move the model back to the Hetzner server so I don't have to pay $0.15/hour for an A4000.

__alexander•1h ago

Care to share the scrapped data? I would love to play around with it.

demaga•1h ago

I am not sure about legal side of things here, but a Kaggle dataset would be really cool

costco•1h ago

Not sure if I can. At the very least book descriptions most likely could not be distributed. There is an academic dataset with around 200M reviews though: https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html

guelo•52m ago

I'm surprised he got that much data. Goodreads uses several tricks to try to stop scrapers, for example pagination only works up to a few pages.

jacquesm•44m ago

They might send him a bill for use of resources.

MattGrommes•1h ago

This is cool but I'd love the option to filter out the author of the book you entered. I put in Shroud by Adrian Tchaikovsky and almost all the books are others by him, which is fine but doesn't really mix up the stuff I'm reading.

nwhnwh•59m ago

I entered "Alone Together: Why We Expect More from Technology and Less from Each Other" and I received books about Steve Jobs, Harry Potter and "The Subtle Art of Not Giving a F*ck". Like how???

costco•55m ago

If you want recommendations solely based on one book, please try the similar page: https://book.sv/similar?id=13566692

These seem to fit the description you are going for better. The model is trained to predict the next book in the sequence. Those other books you listed happen to be very popular, so in the absence of information about you (only having 1 book), the model will tend to recommend those.

BeetleB•52m ago

> Provide 3+ books for best results.

jauntywundrkind•54m ago

Where do nice scrapes like this end up? Are there BitTorrents out there for scrapes like this?

Honestly this would finally be the web2.0 we all wanted & hoped for. It's against majesty that it's all captured owned user content that is legally captured by essentially public message boards/sites.

jimmoores•53m ago

I unexpectedly liked this. I thought the recommendations were actually useful.

parkersweb•7m ago

I sadly didn’t share that experience - I fed it my goodreads most recent - but it largely picked up on 2 or 3 series I’ve been slowly working my way through so that most of the recommendation list was ALL the other books in the series (and the spin-off series) so I didn’t really get anything useful…

dbl000•50m ago

Echoing what everyone else has said here - awesome site, love how fast it was.

I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.

Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.

It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.

For more common books though it lined up very well with books already on my wishlist!

costco•28m ago

Yes I would say the handling of series is probably the biggest problem. Once my test metrics got to a point I was happy with and my quality spot checks passed (can I follow the models recommendations from one generic history book to Steven Runciman, making sure popular books don't always dominate the results), I was ready to release because I had been working on this project for so long. The solution is probably using the transformer model to generate 100-200 candidates and then having a reranker on top.

xkbarkar•42m ago

Have nothing to add that hasn’t already been commented. Like the entries in the add list stay. Other than that, my recommendation list keeps coming up with books I have already read and loved and I am hitting the limit :(.

So filtering would be great,

I have seen a few versions of the same books listed more than once.

Loved this. Hope you get to tune it a little.

Also, thank you for not ruining the site with a single popup, email subscription list offer, chatbot, wheelspin from hell anywhere.

Blessings from the popup hating part of the interwebs.

_virtu•39m ago

Hey OP I’m building a bookclub app. Do you happen to have an api I could plug into? I’d love to add this to our member suggestions section.

androng•27m ago

I tried to import my book list with "Import goodreads" button and inputting https://www.goodreads.com/user/show/68515148-andrew but it said "import failed, see console"

costco•25m ago

Worked for me, could be due to server being overwhelmed

Here is the URL with your books: https://book.sv/#52752877,46049530,18437030,52480873,3260654...

blehn•21m ago

You should filter out authors from the input books in the output. If liked a book by an author, surely I'd read more of their work if I wanted to — recommending them isn't helpful. Along the same lines, I think interesting recommendations tend to be the ones that (1) I like and (2) I didn't expect. The more similar the recommendations are to the input, the more likely I already know them, and the more likely to create a recommendation echo chamber.

sodality2•16m ago

This is fantastic!!! I've added many results to my want-to-read list, they're very on-point from very few inputs. It would be really cool to import from a user ID, where you can choose some subset of your read list to inspire new suggestions, while excluding all books in your want-to-read and already-read lists. But that's an ongoing scrape to maintain, it's a cat and mouse game you probably don't want to start. I wonder what the legal status of scraped training data is... if you don't reproduce any of the review data I presume you're fine?

costco•8m ago

You can import the first or last 64 books of your read, to-read, or currently-reading shelves if you press the "Import Goodreads" button and provide your Goodreads ID.

Two billion email addresses were exposed

You Should Write An Agent

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Swift on FreeBSD Preview

ICC ditches Microsoft 365 for openDesk

Open Source Implementation of Apple's Private Compute Cloud

Hightouch (YC S19) Is Hiring

LLMs Encode How Difficult Problems Are

C++: A prvalue is not a temporary

The Parallel Search API

FBI tries to unmask owner of archive.is

Universe's expansion 'is now slowing, not speeding up'

I analyzed the lineups at the most popular nightclubs

Please stop asking me to provide feedback #8036

Eating stinging nettles

Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

Black Hole Flare Is Biggest and Most Distant Seen

Springs and Bounces in Native CSS

Writing Advice

Mathematical exploration and discovery at scale

Auraphone: A simple app to collect people's info at events

Show HN: Dynamic code and feedback walkthroughs with your coding Agent in VSCode

Show HN: See chords as flags – Visual harmony of top composers on musescore

UK outperforms US in creating unicorns from early stage VC investment

Benchmarking the Most Reliable Document Parsing API

I may have found a way to spot U.S. at-sea strikes before they're announced

How often does Python allocate?

Supply chain attacks are exploiting our assumptions

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

Two billion email addresses were exposed

You Should Write An Agent

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Swift on FreeBSD Preview

ICC ditches Microsoft 365 for openDesk

Open Source Implementation of Apple's Private Compute Cloud

Hightouch (YC S19) Is Hiring

LLMs Encode How Difficult Problems Are

C++: A prvalue is not a temporary

The Parallel Search API

FBI tries to unmask owner of archive.is

Universe's expansion 'is now slowing, not speeding up'

I analyzed the lineups at the most popular nightclubs

Please stop asking me to provide feedback #8036

Eating stinging nettles

Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

Black Hole Flare Is Biggest and Most Distant Seen

Springs and Bounces in Native CSS

Writing Advice

Mathematical exploration and discovery at scale

Auraphone: A simple app to collect people's info at events

Show HN: Dynamic code and feedback walkthroughs with your coding Agent in VSCode

Show HN: See chords as flags – Visual harmony of top composers on musescore

UK outperforms US in creating unicorns from early stage VC investment

Benchmarking the Most Reliable Document Parsing API

I may have found a way to spot U.S. at-sea strikes before they're announced

How often does Python allocate?

Supply chain attacks are exploiting our assumptions

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Comments