frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

My LSM tree was slower than a B-tree. Then I profiled it

https://aasheesh.vercel.app/blog/lsm-tree
15•aasheeshrathour•3d ago

Comments

jmalicki•1h ago
Writing to disk for every write is required, otherwise you're not durable.

Sure it's faster to never write to disk, then you reboot and you've lost data.

/dev/null is a webscale database that is even faster!

FarmerPotato•52m ago
read the whole article. WAL is the transaction log and the author tested correctness after a crash.
jmalicki•47m ago
"Every batch of writes called file.Write on the write-ahead log"

You don't write to the WAL on a batch.

> the author tested correctness after a crash.

You mean the LLM?

bawolff•34m ago
They tested SIGKILLing the process, they didn't test a power loss situation.
teraflop•31m ago
Well, the thing about reliability is that you can't really guarantee it by testing one particular scenario.

It seems to me that neither the old nor the new version of the code is really "durable" as I would understand the word. The old version made a write syscall per batch, but doesn't say it also did an fsync per batch. The new version writes data to an mmap'ed file, and calls fsync in the background.

So both versions are "durable" in the sense that written data is preserved even if the process gets killed, because it's in the OS page cache. But in both versions, a write can be completed before the data actually makes it to disk, so a power failure will lose acknowledged writes.

Retr0id•31m ago
There are a lot of use cases where you only truly need consistency, and durability can take a back seat. RocksDB for example does not fsync its WAL writes in the default configuration.

https://github.com/facebook/rocksdb/wiki/WAL-Performance#non...

jmalicki•18m ago
If you can't at least guarantee write ordering you don't even have consistency.

Fsync is often used when the data doesn't truly need to be on disk, because there aren't very good write ordering APIs exposed, even if that's all you truly need.

dj_axl•49m ago
> A 100-bit bloom filter holding 100,000 keys is saturated instantly. Every bit is set. It returns “maybe present” for every key you ask about — which means it filters nothing, and every read falls through to a full file scan.

Hahaha. (Seems like the bloom filter library isn't set for maximum false positive rate and/or to autoexpand.)

Edit: Actually there's a BloomFalsePositive setting, maybe it never gets used? Also maybe it's not a library and it's a custom implementation.

FarmerPotato•40m ago
I guess you've never made a silly mistake, found it, and admitted it.

The author wrote this as a learning exercise. And is sharing the process.

AtlasBarfed•46m ago
Your right should go into a queue and get compacted later on?

That's what Cassandra does iirc

Retr0id•45m ago
> A 100-bit bloom filter holding 100,000 keys is saturated instantly

> This is the kind of bug you only find by building the thing and measuring it.

No? I mean, maybe if you're vibecoding it's the only way, but in the prehistoric days you could reason about what code would do before you ran it.

bawolff•36m ago
Mistakes are always easy to recognize in retrospect, so hopefully this comment isnt too unfair, but one thing that caught me about this, is that logically it makes no sense. You would never use a bloom filter for just 10 entries. If you have only 10 entries it is almost certainly faster to skip the bloom filter. So i feel like that is the part that should have instantly stood out.

[Obviously, i've made my own silly mistakes over the years, many much sillier than this, its just weird to describe this one as only detectable by profiling]

FarmerPotato•27m ago
Sure, it logically makes no sense. But while learning a new subject, have you never made a silly mistake like:

bool getSchemaSizes(size_t * expectedBatchSize, size_t * expectedEntriesPerBlock) { ... }

size_t expectedEntriesPerBlock, expectedBatchSize;

getSchemaSizes(&expectedEntriesPerBlock, &expectedBatchSize)

initBloomFilter(expectedEntriesPerBlock)

bawolff•25m ago
I said as much in my comment.
FarmerPotato•34m ago
Do you think the author is somehow capable of writing the entire codebase, but not able to reason about code???

I'm sure you've never made a silly mistake where you passed the wrong integer parameter to a function, stared at your screen, and failed to notice it. Or, forgot the order of arguments to calloc().

If you're saying that profiling is for those too lazy to reason about their code, you're distorting the whole lesson: profiling is more powerful than guessing.

teraflop•34m ago
The article doesn't link to it but this appears to be the repo in question: https://github.com/AasheeshLikePanner/lsm-tree-go

I'm very amused by this obviously AI-generated "benchmark program": https://github.com/AasheeshLikePanner/lsm-tree-go/blob/main/...

sheepcow•24m ago
> A few weeks ago I wanted to understand how the storage engine inside RocksDB actually works. Not read about it. Build it.

Immediate tell that this was written by AI. Another thing I've noticed lately - AI's overuse of "every":

> Every batch of writes called `file.Write` on the write-ahead log.

> Every read was scanning entire SSTable files.

> Every bit is set.

> Every value matches.

FarmerPotato•13m ago
Well, this is sad.
achierius•32m ago
No, that's not the point. This isn't a situation where you need to "guess"; bloom filters should be sized according to their capacity. This is akin to having a fixed 10-arg buffer for your program, getting a crash when someone passes 11, and saying "this is the kind of bug you only find by building the thing and measuring it". Yeah it happens and we all make silly mistakes, but it's just not true that this couldn't have been foreseen.
shermantanktop•28m ago
I'm called in to consult on a performance problem on a scaled service. Team was load testing their code and seeing low throughput:

Me: so you have an in-memory cache, right?

Them: yes!

Me: what is the TTL?

Them: Oh, it's not set, oops. Here, let's set it to 1 minute. Hey look, the performance went way up!

Me: okay, great. When you say 1 minute, do you mean 60 seconds?

Them: uh...wait...uh....oh, the unit is seconds. Wait, why is the performance so good with a 1 second TTL?

Me: What's your load test?

Them: We crank 1M TPS fetching the same 30 items over and over.

Me: ....

I totally agree about the power of profiling but profiling without understanding would not have helped this team.

FarmerPotato•23m ago
So the author is doing a self-learning exercise about profiling pre-production code, and you're disagreeing with them by comparing it to a commercial contract. I'm sure you've never, ever made a dumb mistake while getting paid.
FarmerPotato•24m ago
Cool! My first downvote!
Retr0id•24m ago
I make all sorts of silly mistakes, but I'd rarely say that running the code is the only way to detect issues.

I also don't think the author wrote much of their codebase, or much of their blog post, but that's the brave new world we're living in.

ignoreusernames•31m ago
Yeah, especially a bloomfilter which has a pretty easy formula for its false positive rate.
jasonwatkinspdx•24m ago
A lot of people know the basic rule of thumb that a byte per element gives you a bit more than a 1% false positive rate.

But even just thinking about it for half a second from a balls and bins perspective, 100k items into 100 binary bins is obviously gonna saturate.

tensegrist•12m ago
i don't know why you're trying to analyze the meaningfulness of sentences that are not the results of a human thought process but are clearly rhetorical flourishes from an llm that "feels" compelled to fill its prose with them
Retr0id•9m ago
Comments that explicitly call out an article as slop tend to get downvoted (or disagreed with), it's best to guide the reader towards their own conclusions.
paulb73•5m ago
Isn't this what units tests are for?

Dutch Railways offers unlimited off-peak train travel nationwide for €49/month

https://www.ns.nl/en/season-tickets/dal-vrij
43•felipevb•3d ago•10 comments

I found 10k GitHub repositories distributing Trojan malware

https://orchidfiles.com/github-repositories-distributing-malware/
493•theorchid•8h ago•125 comments

Swiss parliament lifts ban on new nuclear power plants

https://www.bluewin.ch/en/news/switzerland/parliament-lifts-ban-on-new-nuclear-power-plants-32575...
539•leonidasrup•6h ago•356 comments

Ubiquiti: Enterprise NAS, Built on ZFS

https://blog.ui.com/article/introducing-enterprise-nas
179•ksec•6h ago•165 comments

Noam Shazeer Joins OpenAI

https://twitter.com/NoamShazeer/status/2067400851438932297
127•lukasgross•20h ago•74 comments

Migrating from GNU Stow to Chezmoi

https://rednafi.com/misc/chezmoi/
58•speckx•3h ago•62 comments

Hospitals and universities repurposing drugs at lower cost

https://www.kcl.ac.uk/news/hospitals-and-universities-repurposing-drugs-at-90-lower-cost
250•giuliomagnifico•10h ago•107 comments

Launch HN: TesterArmy (YC P26) – Agents that test web and mobile apps

https://tester.army
76•okwasniewski•5h ago•34 comments

CS 6120: Advanced Compilers: The Self-Guided Online Course (2020)

https://www.cs.cornell.edu/courses/cs6120/2025fa/self-guided/
233•ibobev•9h ago•38 comments

The founder of Craigslist has given away half a billion dollars

https://www.independent.co.uk/us/money/craigslist-multimillionaire-craig-newmark-b2980681.html
216•Tomte•3h ago•134 comments

W Social, public institutions and the theater of European digital sovereignty

https://blog.elenarossini.com/w-social-public-institutions-and-the-theater-of-european-digital-so...
142•nemoniac•7h ago•92 comments

The Korean telecom giant at the center of Anthropic's Mythos controversy

https://www.wired.com/story/sk-telecom-anthropic-mythos-export-controls/
19•dstala•7h ago•7 comments

Agentic Resource Discovery Specification

https://agenticresourcediscovery.org/introduction/
20•damick•1d ago•6 comments

The Token Compression Illusion: Why I'm Skeptical of RTK

https://mroczek.dev/articles/the-token-compression-illusion-why-im-skeptical-of-rtk/
38•lackoftactics•2h ago•48 comments

Modos Color Monitor Pushes E-Paper Displays Further

https://spectrum.ieee.org/modos-e-paper-monitor
178•Vinnl•8h ago•46 comments

Show HN: Gerrymandle - Daily puzzle game where you redraw electoral districts

https://gerrymandle.cc/
91•realmofthemad•6h ago•41 comments

Emacs, how it all started for me

https://xvw.lol/en/articles/emacs-start.html
94•nukifw•3d ago•35 comments

Ask HN: Is anyone using the A2A protocol?

25•asim•11h ago•8 comments

A website that lists websites to submit your website to

https://www.submission.directory/
344•azeemkafridi•5h ago•79 comments

DeepSeek Introduces Vision

https://chat.deepseek.com/
425•RIshabh235•14h ago•172 comments

Emacs 31 is around the corner: The changes I'm daily driving

https://www.rahuljuliato.com/posts/emacs-31-around-the-corner
352•frou_dh•8h ago•200 comments

.gitignore Isn't the only way to ignore files in Git

https://nelson.cloud/.gitignore-isnt-the-only-way-to-ignore-files-in-git/
200•FergusArgyll•10h ago•64 comments

Integer Quantization: Deep Dive

https://hello-fri-end.github.io/2026/06/integer-quantization-deep-dive/
8•matt_d•1h ago•1 comments

How Alberta Eradicated Rats

https://worksinprogress.co/issue/albertas-war-on-rats/
92•tzury•7h ago•81 comments

Microsoft new Outlook takes 10 seconds to do what Outlook Classic does instantly

https://www.windowslatest.com/2026/06/15/microsofts-new-outlook-takes-10-seconds-to-do-what-outlo...
476•Adam-Hincu•8h ago•335 comments

My LSM tree was slower than a B-tree. Then I profiled it

https://aasheesh.vercel.app/blog/lsm-tree
15•aasheeshrathour•3d ago•28 comments

Notes from tired Egyptian whose job is explaining that humans built the pyramids

https://www.mcsweeneys.net/articles/notes-from-a-tired-egyptian-guy-whose-job-is-explaining-that-...
100•Geekette•2d ago•76 comments

TerraPower in deal with Meta for eight Natrium 345 MW nuclear plants

https://neutronbytes.com/2026/01/09/terrapower-in-mega-deal-with-meta-for-eight-natrium-345-mw-ad...
82•mpweiher•5h ago•78 comments

Midjourney Medical

https://www.midjourney.com/medical/blogpost
1248•ricochet11•18h ago•837 comments

The Harajuku Moment (2024)

https://tim.blog/2024/02/09/harajuku-moment/
65•abhaynayar•4h ago•45 comments