frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The 1337 PNG Hashquine (2022)

https://hackaday.com/2022/09/28/the-1337-png-hashquine/
1•airstrike•2m ago•0 comments

Show HN: In Memoria – MCP server that stops AI assistants from forgetting

https://github.com/pi22by7/In-Memoria
1•pi_22by7•2m ago•1 comments

Tech YouTuber irate as AI "wrongfully" terminates account with 350K+ subscribers

https://www.dexerto.com/youtube/tech-youtuber-irate-as-ai-wrongfully-terminates-account-with-350k...
1•healsdata•2m ago•0 comments

Posteverywhere.ai

https://posteverywhere.ai
1•bellamoon544•3m ago•1 comments

Creating an all-in-one academic ecosystem and professional social network

https://www.pacr.co/
1•anony_matty•3m ago•1 comments

CS Spinjam Shopee

1•masikusi•5m ago•0 comments

A Circling Story

https://emergencemagazine.org/essay/a-circling-story/
1•mooreds•5m ago•0 comments

CS LazBon Lazada

1•masikusi•6m ago•0 comments

When the Slack Channel Gets Archived, but the Service Keeps Running

https://earthly.dev/blog/slack-archived-service-running/
1•mooreds•6m ago•0 comments

Show HN: I made a Bluesky algorithm that Rick Rolls you with trending content

https://bsky.app/profile/idreesinc.com/feed/nggyunglyd
1•IdreesInc•6m ago•0 comments

Profiling with Cursor 2.0: The Missing Layer in AI Code Generation

https://ryanperry.io/post/cursor-profiling-missing-layer
1•Rperry2174•6m ago•0 comments

US Clouds: German Research Foundation DFG wants to bring data back from abroad

https://www.heise.de/en/news/US-Clouds-German-Research-Foundation-DFG-wants-to-bring-data-back-fr...
1•amai•8m ago•0 comments

CS Spinjam Shopee

1•ameliahan•8m ago•0 comments

WhatsApp Web Shows "Download WhatsApp for Mac" Screen

1•cateye•9m ago•0 comments

Gautier de Brienne: Duke of Athens

https://florenceasitwas.wlu.edu/people/gautier-de-brienne
1•thunderbong•9m ago•0 comments

CS EasyCash

1•ameliahan•9m ago•0 comments

AI Prompts for Nonprofit Professionals

https://nonprofit.ai
1•earino•10m ago•0 comments

Run Any LLM with a Single API: Introducing Any-LLM v1.0

https://blog.mozilla.ai/run-any-llm-with-a-single-api-introducing-any-llm-v1-0/
2•aittalam•10m ago•0 comments

All Your Base Are Belong to Us

https://nkexe.substack.com/p/all-your-base-are-belong-to-us
1•amonop•11m ago•0 comments

Neo and the Future of Jobs

https://joshbocanegra.medium.com/neo-the-future-of-jobs-48c31ff2de8c
1•jbai•12m ago•1 comments

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

https://fuvty.github.io/C2C_Project_Page/
1•mooreds•12m ago•0 comments

The Red Riding Hood Chronicles by TechWrath Studios

https://www.patreon.com/posts/red-riding-hood-82658420
1•techwrath11•13m ago•0 comments

Starbucks Sells Majority Stake in China Business to Boyu Capital

https://finimize.com/content/starbucks-sells-majority-stake-in-china-business-to-boyu-capital
1•mgh2•13m ago•0 comments

How to Draw a 2D Chibi Character in Adobe Illustrator CC [video]

https://www.youtube.com/watch?v=pg52_J3TTzI
1•techwrath11•13m ago•0 comments

US Traces Ransomware Attacks to 2 People Working for Cybersecurity Firms

https://www.pcmag.com/news/us-traces-ransomware-attacks-to-2-people-working-for-cybersecurity-firms
3•WaitWaitWha•14m ago•1 comments

AWS announces Fastnet, its first solo subsea cable project

https://www.aboutamazon.com/news/aws/transatlantic-subsea-cable-us-ireland-fastnet-aws
1•Henry3•15m ago•0 comments

Searles's Chinese Room: Case study in philosophy of mind and cognitive science

https://cse.buffalo.edu/~rapaport/Papers/Papers.by.Others/reingold-on-searle.html
2•mhb•15m ago•0 comments

High-performance 2D graphics rendering on the CPU using sparse strips [pdf]

https://github.com/LaurenzV/master-thesis/blob/main/main.pdf
1•todsacerdoti•15m ago•0 comments

LLMs as Interpreters: The Probabilistic Runtime for English Programs

https://responseawareness.substack.com/p/llms-as-interpreters-the-probabilistic
2•gmays•15m ago•0 comments

Why AI Can't Write Good Software

https://blog.jpillora.com/p/why-ai-cant-write-good-software
1•jpillora•16m ago•1 comments
Open in hackernews

Bloom filters are good for search that does not scale

https://notpeerreviewed.com/blog/bloom-filters/
74•birdculture•5h ago

Comments

sanskarix•3h ago
the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.

I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.

the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.

MattPalmer1086•3h ago
May be true for offline full text search, but not true for online string search.

I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].

[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...

[2] https://sea2024.univie.ac.at/accepted-papers/

[3] https://github.com/nishihatapalmer/HashChain

majke•3h ago
Bing uses bloom filters for the most-recent index:

https://dl.acm.org/doi/pdf/10.1145/3077136.3080789

https://bitfunnel.org/strangeloop/

susam•2h ago
When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.

susam•1h ago
I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.

This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.

anonymars•1h ago
Ha, collision avoidance all the way down
gkfasdfasdf•1h ago
Interesting trick about the constant value, and thank you for the detailed write up!
6510•21m ago
I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.

Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.

A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.

a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw

The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.

I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.

Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e

Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.

But it works and loading a hand full of files or doing an AND is amazingly fast.

susam•7m ago
Nice! Thanks for sharing. What you've described here sounds like an n-gram inverted index with n = 2, represented as a bitset. If I could give it a name, I'd call it a bigram bitset inverted index or BBII (or B2I2) for short. Glad to know you design and implemented all of this from first principles, and that it serves you well!