*The problem*
We got into specialty coffee gradually. Whenever we tried something we liked — a washed Colombian, a natural Ethiopian — we'd save the bag. At some point we had drawers of empty coffee bags we couldn't bring ourselves to throw away.
Our flow was simple: go to a cafe we liked, drink coffee, buy a bag of whatever they were roasting or stocking. Over time we started noticing patterns — we kept reaching for naturals, for East African origins, for anything with fruity notes. We'd try to seek out similar beans next time. Occasionally we'd fall in love with something and start reordering it online.
But discovery was limited to the few roasters we already knew. There was no easy way to find out that a roaster across town — or in another country — had something we were going to love. We knew great coffee existed out there. We just had no map.
So we built one. RoastDB currently indexes 3,800+ beans from 420+ roasters — and growing every week. Search by origin, process, variety, tasting notes. Save beans you want to try. When you find something, you buy directly from the roaster — we're a discovery engine, not a store.
*How it works*
The hardest part isn't the scraping — it's finding roasters worth indexing. We spend a lot of time hunting for quality third-wave roasters: browsing coffee forums, following competition results, exploring roasters in new cities. The selection is the real work.
Once we've found a roaster, the pipeline runs on a €5/month Hetzner VPS:
1. Scrapers fetch product pages from roaster websites
2. LLMs extract structured data (origin, variety, processing, price, tasting notes)
3. Normalization cleans up inconsistencies ("Äthiopien" → "Ethiopia", "84,25" → 84.25)
4. Non-English descriptions get translated
5. Deduplication scores beans and merges duplicates
6. Human review via an admin dashboard before publishing
The scrapers rerun weekly with content hashing — we only re-extract pages that actually changed, which keeps the data fresh without burning through API costs.
We built an internal tool that gamifies the review process, making it easier to keep up with new beans. And we control the whole pipeline through a Telegram bot — kick off scrapes, approve costs, get notified of failures, all from our phones.
The web app is Next.js + SQLite. The database file is ~15MB and serves directly from disk, no complexity.
*Feedback welcome*
- Roasters we should add (especially outside Europe)
- Filter combinations that would be useful
- Anything broken or confusing