frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Tactical tornado is the new default

https://olano.dev/blog/tactical-tornado/
1•facundo_olano•1m ago•0 comments

Full-Circle Test-Driven Firmware Development with OpenClaw

https://blog.adafruit.com/2026/02/07/full-circle-test-driven-firmware-development-with-openclaw/
1•ptorrone•1m ago•0 comments

Automating Myself Out of My Job – Part 2

https://blog.dsa.club/automation-series/automating-myself-out-of-my-job-part-2/
1•funnyfoobar•2m ago•0 comments

Google staff call for firm to cut ties with ICE

https://www.bbc.com/news/articles/cvgjg98vmzjo
2•tartoran•2m ago•0 comments

Dependency Resolution Methods

https://nesbitt.io/2026/02/06/dependency-resolution-methods.html
1•zdw•2m ago•0 comments

Crypto firm apologises for sending Bitcoin users $40B by mistake

https://www.msn.com/en-ie/money/other/crypto-firm-apologises-for-sending-bitcoin-users-40-billion...
1•Someone•3m ago•0 comments

Show HN: iPlotCSV: CSV Data, Visualized Beautifully for Free

https://www.iplotcsv.com/demo
1•maxmoq•4m ago•0 comments

There's no such thing as "tech" (Ten years later)

https://www.anildash.com/2026/02/06/no-such-thing-as-tech/
1•headalgorithm•4m ago•0 comments

List of unproven and disproven cancer treatments

https://en.wikipedia.org/wiki/List_of_unproven_and_disproven_cancer_treatments
1•brightbeige•5m ago•0 comments

Me/CFS: The blind spot in proactive medicine (Open Letter)

https://github.com/debugmeplease/debug-ME
1•debugmeplease•5m ago•1 comments

Ask HN: What are the word games do you play everyday?

1•gogo61•8m ago•1 comments

Show HN: Paper Arena – A social trading feed where only AI agents can post

https://paperinvest.io/arena
1•andrenorman•9m ago•0 comments

TOSTracker – The AI Training Asymmetry

https://tostracker.app/analysis/ai-training
1•tldrthelaw•13m ago•0 comments

The Devil Inside GitHub

https://blog.melashri.net/micro/github-devil/
2•elashri•14m ago•0 comments

Show HN: Distill – Migrate LLM agents from expensive to cheap models

https://github.com/ricardomoratomateos/distill
1•ricardomorato•14m ago•0 comments

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

https://github.com/sigmastratum/documentation/tree/main/sigma-runtime/SR-053
1•teugent•14m ago•0 comments

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•15m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•16m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•16m ago•1 comments

The Neuroscience Behind Nutrition for Developers and Founders

https://comuniq.xyz/post?t=797
1•01-_-•16m ago•0 comments

Bang bang he murdered math {the musical } (2024)

https://taylor.town/bang-bang
1•surprisetalk•16m ago•0 comments

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

https://konfuzio.com/en/a-night-without-the-nerds-claude-opus-4-6-in-the-field-test/
1•konfuzio•19m ago•0 comments

Could ionospheric disturbances influence earthquakes?

https://www.kyoto-u.ac.jp/en/research-news/2026-02-06-0
2•geox•20m ago•1 comments

SpaceX's next astronaut launch for NASA is officially on for Feb. 11 as FAA clea

https://www.space.com/space-exploration/launches-spacecraft/spacexs-next-astronaut-launch-for-nas...
1•bookmtn•22m ago•0 comments

Show HN: One-click AI employee with its own cloud desktop

https://cloudbot-ai.com
2•fainir•24m ago•0 comments

Show HN: Poddley – Search podcasts by who's speaking

https://poddley.com
1•onesandofgrain•25m ago•0 comments

Same Surface, Different Weight

https://www.robpanico.com/articles/display/?entry_short=same-surface-different-weight
1•retrocog•27m ago•0 comments

The Rise of Spec Driven Development

https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html
2•Brajeshwar•31m ago•0 comments

The first good Raspberry Pi Laptop

https://www.jeffgeerling.com/blog/2026/the-first-good-raspberry-pi-laptop/
3•Brajeshwar•31m ago•0 comments

Seas to Rise Around the World – But Not in Greenland

https://e360.yale.edu/digest/greenland-sea-levels-fall
2•Brajeshwar•31m ago•0 comments
Open in hackernews

Bloom filters are good for search that does not scale

https://notpeerreviewed.com/blog/bloom-filters/
208•birdculture•3mo ago

Comments

sanskarix•3mo ago
the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.

I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.

the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.

adamzwasserman•3mo ago
Exactly. A 1.2% false positive rate means unnecessary reads 1.2% of the time vs 100% without the filter. Even at 10% FP rate, you skip 90% of I/O.

This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.

In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.

munchbunny•3mo ago
> that asymmetry is weirdly powerful for specific problems.

100% agree when it works in your favor. We use it for exactly that situation where a non-zero false positive rate is fine and you can choose how much memory to devote to getting it closer to zero.

There have a been a couple times though where we've needed a "keep everything not on this list" operation, and unfortunately bloom filters don't work well for that situation. There are alternatives, but none as elegantly compact as bloom filters.

MattPalmer1086•3mo ago
May be true for offline full text search, but not true for online string search.

I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].

[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...

[2] https://sea2024.univie.ac.at/accepted-papers/

[3] https://github.com/nishihatapalmer/HashChain

lll-o-lll•3mo ago
Looks really interesting! If I wanted to try one of these, which of the family would you recommend to start with?
MattPalmer1086•3mo ago
Hashchain is the easiest to implement, but like many search algorithms can suffer from quadratic performance on really bad data and patterns (e.g. seaching for a long sequence of zero bytes in a text of zero bytes). In practice this is very rare.

If you want guaranteed linear performance in the worst case too, then LinearHashchain is the one to use. It is slightly more complex to implement as it builds in a KMP style verifier to make it linear (so almost two search algorithms are needed). It is actually about as fast as HashChain generally in the average case, so you don't lose out.

The others are either experimental or quite niche and not suitable for most purposes. SentinelHashchain is actually the fastest, but relies on being able to add a copy of the pattern at the end of the search text. Mostly this won't be possible in most search contexts, unless you control all memory allocations.

So I'd start with HashChain, and maybe play with the linear version later - most of it is the same, it just needs a bit more adding.

majke•3mo ago
Bing uses bloom filters for the most-recent index:

https://dl.acm.org/doi/pdf/10.1145/3077136.3080789

https://bitfunnel.org/strangeloop/

susam•3mo ago
When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.

susam•3mo ago
I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.

This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.

anonymars•3mo ago
Ha, collision avoidance all the way down
gkfasdfasdf•3mo ago
Interesting trick about the constant value, and thank you for the detailed write up!
6510•3mo ago
I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.

Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.

A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.

a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw

The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.

I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.

Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e

Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.

But it works and loading a hand full of files or doing an AND is amazingly fast.

susam•3mo ago
What you've described here is an n-gram inverted index (with n = 2) represented as a bitset. We could call it a bigram bitset inverted index. Glad to know you designed and implemented all of this from first principles, and that it serves you well!
hinkley•3mo ago
I had a problem where we needed to compare large data sets between machines for keys that existed in both, and the bandwidth cost just wasn’t mathing for the median result set size. I was trying to figure out how to send a fingerprint from machine A to B, then have machine B send the hits back. Or how many round trips I could do based on set size to minimize bandwidth + latency. I ended up with a calculus problem nobody could help me solve because of an n^5 term.

My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.

UltraSane•3mo ago
Splunk uses bloom filters to make searching for rare events fast. Rare events are usually the most interesting.
hinkley•3mo ago
I’ve only used Splunk with one set of devs and maybe we were doing it wrong, but it didn’t feel fast to me.

Several of us were working hard to move everything into Prometheus that made any sense to be in Prometheus instead of Splunk.

Notably any time we had a production issue that it was unclear which team was responsible, Splunk became the bottleneck because we started exceeding quotas immediately.

UltraSane•3mo ago
Splunk is one of the best software I've ever used but it HAS to be used with very fast storage to be effective. I've only used it on enterprise grade storage arrays and servers with lots of RAM for caches. On modern PCIe 5.0 NVMe drives it is stupid fast.

I'm not sure what you mean by exceed quotas because Splunk is normally licensed on GB ingested per day. This can lead to bitter fights between teams over how this is allocated.

The good thing about this license model is that you can use as much hardware as you want for no extra license cost.

hinkley•3mo ago
> used with very fast storage

That’s sounds like self hosting. Which is not the only product they offer. But you still have hardware that can only run so many queries at once and then starts queuing any additional request, yeah? Once you have a dozen people on a call it went to shit. Only occasionally ran into problems like this with graphite. But you need a lot of people looking at a very large dashboard to start feeling refresh delays.

NooneAtAll3•3mo ago
what's RSA?
kristianp•3mo ago
I assume it's RSA Security https://en.wikipedia.org/wiki/RSA_Security
twotwotwo•3mo ago
There is a largish category of tools now where, unlike in OLTP systems, a big focus is scanning data but quickly (O(n) but with a good constant): Redshift, Trino/Athena, ClickHouse, DuckDB among others.

Bloom filter indexing seems like a great fit if you ever need to do substring searches in a context like that, and for log searching in general. I haven't dug into what all packages have it, but it looks like at least ClickHouse does: https://clickhouse.com/docs/optimize/skipping-indexes#bloom-...

hijinks•3mo ago
is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?
philipkglass•3mo ago
There are a lot of "better than Bloom" filters that work similarly in some aspects. I have used Cuckoo [1] and Ribbon [2] filters for Bloom-type applications. If you have an application where you do a lot of one kind of searching, it may also be worth implementing a specialized variant of a data structure. I needed a Cuckoo-type filter on the JVM but only for 64 bit integers and I was able to make a smaller, faster code base that was specialized to this data type instead of handling generic objects.

You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.

[1] https://en.wikipedia.org/wiki/Cuckoo_filter

[2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...

hijinks•3mo ago
thanks.. i'll read up into these.. always amazes me that companies like datadog somehow made log search quick
hinkley•3mo ago
I wonder how often in the wild people are tuning for a 1% false positive rate versus a much lower one, like .1%. You do quickly reach data set sizes where even 1% introduces some strain on resources or responsiveness.

Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)

FreakLegion•3mo ago
Cuckoo filters can do even better with the small adjustment of using windows instead of buckets. See "3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit": https://scispace.com/pdf/3-5-way-cuckoo-hashing-for-the-pric.... (This significantly improves load factors rather than changing anything else about the filter, and ends up smaller than the semi-sorted variant for typical configurations, without the rigmarole.)

My fairly niche use case for these kinds of data structures was hardware firewalls running mostly on SRAM, which needed a sub one-in-a-billion false positive rate.

adamzwasserman•3mo ago
The "no sharing between filters" insight clicked for me on a different problem.

I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.

Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.

TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.

hinkley•3mo ago
Why do these “inverted indexes” just look like indexes to me? Too much time with databases perhaps?
adamzwasserman•3mo ago
A non-unique index, yes.
hinkley•3mo ago
Which is most indexes.
adamzwasserman•3mo ago
again, I agree
farsa•3mo ago
The distinction is more clear when indexing actual text and applying tokenization. A "typical" index on a database column goes like "column(value => rows)". When people mention inverted indexes its usually in the context of full text search, where "column value" usually goes through tokenization and you build an index for all N tokens of a column "column:(token 1 => rows)", "column:(token 2 => rows)",... "column:(token N => rows)".
pi_22by7•3mo ago
The key insight about bloom filters lacking synergy is excellent. The ~7K document crossover point makes sense because inverted indexes amortize dictionary storage across all documents while bloom filters must encode it linearly per document
hinkley•3mo ago
But doesn’t that depend on the cardinality of the indexes versus the document count? I’ve seen systems with a stupid number of tag values.
pauldix•3mo ago
I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...

It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.

I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).

There's also code online here: https://github.com/bingmann/cobs

KevBurnsJr•3mo ago
Reminds me of @danthegoodman's project: bloomsearch [1]

[1] https://github.com/danthegoodman1/bloomsearch

taeric•3mo ago
I will forever think of Bloom filters as "bouncer filters." Could go with concierge filter. Basically, it is the equivalent of every movie where the detective is asking the front desk various attributes of who they are looking for.

It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.

hinkley•3mo ago
> Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild.

I thought that was going to be a link to Google.com

pncnmnp•3mo ago
When my friends and I were undergrads (3rd year, I believe), we had an absolute blast exploring this exact topic - the intersection of Bloom Filters and client side searching. So much so that it became part of our undergrad thesis.

It all started when Stavros's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.

I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html

Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf

cristaloleg•3mo ago
Curious why no one mentioned XOR filter yet

https://github.com/FastFilter

cristaloleg•3mo ago
Quick intro https://lemire.me/blog/2019/12/19/xor-filters-faster-and-sma...