One’s for browsing HN at work, the other’s for home, and the third one has a username I'm not too fond of.
I’ll stick to this one :) I might have some karma on the older ones, but honestly, HN is just as fun from everywhere
I intended this to be an easy on-ramp for folks who want to get a feel for how FTS engines work under the hood :)
I appreciate the technical depth of the readme, but I’m not sure it fits your easy on-ramp framing.
Keep going and keep sharing.
I see you are using a positional index rather than doing bi-word matching to support positional queries.
Positional indexes can be a lot larger than non-positional. What is the ratio of the size of all documents to the size of the positional inverted index?
For example the phrase query "United States of America" doesn't occur in the document "The United States is named after states of the North American continent. The capital of America is Washington DC". But "United States", "states of" and "of America" all appear in it.
There's a tradeoff because we still have to fetch the full document text (or some positional structure) for the filtered-down candidate documents containing all of the bi-word pairs. So it requires a second stage of disk I/O. But as I understand most practitioners assume you can get away with less IOPS vs positional index since that info only has to fetched for a much smaller filtered-down candidate set rather than for the whole posting list.
But that's why I was curious about the storage ratio of your positional index.
When you think OP vibe-coded the project but can’t prove it yet
If you're interested in the idea of writing a database, I recommend you checkout https://github.com/thomasjungblut/go-sstables which includes sstables, a skiplist, a recordio format and other database building blocks like a write-ahead log.
Also https://github.com/BurntSushi/fst which has a great Blog post explaining it's compression (and been ported to Go) which is really helpful for autocomplete/typeahead when recommending searches to users or doing spelling correction for search inputs.
Provides a more advanced collection of components to build your own database.
kdawkins•2h ago
What was the motivation to kick this project off? Learning or are you using it somehow?
novocayn•2h ago
It ended up being a clean, reusable component, so I decided to carve it out into a standalone project
The README is mostly notes from my Notion pages, glad you found it interesting!
n_u•1h ago
novocayn•1h ago