frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Reservoir Sampling

https://samwho.dev/reservoir-sampling/
113•chrisdemarco•2h ago

Comments

wood_spirit•2h ago
I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.
owyn•2h ago
Yeah, this was a google interview question for me too. I didn't know the algorithm and floundered around trying to solve the problem. I came up with the 1/n and k/n selection strategy but still didn't get the job lol. I think the guy who interviewed me was just killing time until lunch.

I like the visualizations in this article, really good explanation.

dekhn•1h ago
I didn't know about the algorithm until after I got hired there. It's actually really useful in a number of contexts, but my favorite was using it to find optimal split points for sharding lexicographically sorted string keys for mapping. Often you will have a sorted table, but the underlying distribution of keys isn't known, so uniform sharding will often cause imbalances where some mappers end up doing far more work than others. I don't know if there is a convenient open source class to do this.
wood_spirit•55m ago
Interesting idea, hadn’t that about that way to apply it.

I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.

dekhn•32m ago
I guess the question in my mind is: would you expect a smart person who did not previously know this problem (or really much random sampling at all) to come up with the algorithm on the fly in an interview? And if the person had seen it before and memorized the answer, does that provide any signal of their ability to code?
samwho•2h ago
Hello! o/

I’m the author of this post. Happy to answer any questions, and love to get feedback.

The code for all of my posts can be found at https://github.com/samwho/visualisations and is MIT licensed, so you’re welcome to use it :)

malwrar•1h ago
Love your website’s design, I find all of interactivity, the dog character as an “audience”, and even the font/color/layout wonderful. Loved the article too!
samwho•1h ago
Thank you so much!

The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.

The colour palette is the Wong palette that I learned about from https://davidmathlogic.com/colorblind/.

Oh, and you can pet the dogs. :)

lol768•19m ago
It would've been easy to just use green for the held card and red for the discard pile.

Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)

rdtsc•1h ago
Well done, I really like the animations and the explanation. Especially the case where it's a graph and we can drag ahead or click "shuffle 100 times"

One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.

pandaxtc•1h ago
This is awesome, I loved the interactivity!
istjohn•1h ago
Does your blog have an RSS feed?
fiddlerwoaroof•49m ago
Does this method compose with itself? E.g. if I implement reservoir sampling in my service and then the log collector service implements reservoir sampling, is the result the same as if only the log collector implemented it?
stygiansonic•2h ago
Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: https://www.cs.umd.edu/~samir/498/vitter.pdf
t55•2h ago
great article!
phillipcarter•2h ago
This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.
sadiq•2h ago
This is a really nicely written and illustrated post.

An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling

magicalhippo•1h ago
We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.

He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.

The total count was submitted to some office which used it to estimate the population.

One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.

However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.

Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".

The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...

[1]: https://en.wikipedia.org/wiki/Rock_ptarmigan

foxbee•1h ago
Wonderful illustrations and writing. Real interesting read.
hinkley•56m ago
This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.
dekhn•34m ago
https://en.wikipedia.org/wiki/German_tank_problem
pixelbeat•46m ago
FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation

Bill Gates Accuses Elon Musk of 'Killing Children' by Cutting Foreign Aid

https://www.nytimes.com/2025/05/08/us/bill-gates-elon-musk-killing-children.html
1•breadwinner•1m ago•0 comments

Supporting Independent Businesses Should Be as Easy as Finding Starbucks

https://www.electro-app.com/home
1•piotrsirko•1m ago•0 comments

Engineers create a robot that can jump 10 feet high–without legs

https://techxplore.com/news/2025-04-robot-feet-high-legs.html
1•PaulHoule•3m ago•0 comments

$100K/day cloud bill isn't a Bug – it's by Design

https://old.reddit.com/r/aethernet/comments/1khyt39/100kday_cloud_bill_isnt_a_bug_its_by_design
1•todsacerdoti•3m ago•0 comments

Hard-Earned Lessons from 2 Years of Improving AI Applications

https://blog.ragas.io/hard-earned-lessons-from-2-years-of-improving-ai-applications
1•amrrs•4m ago•0 comments

Open Source SLM Trained for MCP

https://osmosis.ai/blog/applying-rl-mcp
3•KaseyZhang•5m ago•1 comments

Apple Says Google Searches Down on Safari and Google Says Searches Are Up

https://www.seroundtable.com/apple-vs-google-search-changes-39380.html
1•belter•6m ago•0 comments

Photo Library Export Tool for Mac

https://apps.apple.com/en/app/fotomediathek-export/id6741324048
1•HackerMichl•6m ago•0 comments

Structured Outputs by Example

https://structuredoutputsbyexamples.com/
1•jxnl•7m ago•0 comments

I built a meeting scheduler in a month, and it got 500 signups in 24 hours

https://www.warmcal.com
1•ac1990•8m ago•1 comments

Why Google Search Deal Is Critical for Firefox's Future

https://www.omgubuntu.co.uk/2025/05/mozilla-says-google-search-deal-vital-to-firefoxs-survival
1•StanAngeloff•9m ago•0 comments

Don't Look at Stock Markets. Look at the Ports

https://www.theatlantic.com/economy/archive/2025/05/trump-tariff-shipping-ports/682673/
1•paulpauper•11m ago•0 comments

Here’s How To Handle A Recession If The Job Market Were To Plummet

https://www.forbes.com/sites/eliamdur/2025/05/03/heres-how-to-handle-a-recession-if-the-job-market-were-to-plummet/
1•paulpauper•12m ago•0 comments

How to start a school with your friends

https://prigoose.substack.com/p/how-to-start-a-university
1•geverett•13m ago•1 comments

Details Avoid Bias

https://www.overcomingbias.com/p/details-avoid-bias
1•paulpauper•13m ago•0 comments

Fighting Unwanted Notifications with Machine Learning in Chrome

https://blog.chromium.org/2025/05/fighting-unwanted-notifications-with.html
1•feross•13m ago•0 comments

Level-5 CEO says games being made 80-90% by AI "aesthetic sense" a must for devs

https://automaton-media.com/en/news/level-5-ceo-says-games-are-now-being-made-80-90-by-ai-making-aesthetic-sense-a-must-for-developers/
1•msephton•15m ago•1 comments

QUIC restarts, slow problems: udpgrm to the rescue

https://blog.cloudflare.com/quic-restarts-slow-problems-udpgrm-to-the-rescue/
1•emot•24m ago•0 comments

DOGEs K Schutt's computer infected by malware, credentials found in stealer logs

https://micahflee.com/doge-bro-kyle-schutts-computer-infected-by-malware-credentials-found-in-stealer-logs/
5•cycomanic•25m ago•0 comments

Why Everyone's Talking About Crypto Payments in 2025

https://caizcoin.medium.com/paying-with-stablecoins-why-everyones-talking-about-crypto-payments-in-2025-162330290bf7
1•byte-bolter•27m ago•2 comments

Uber's Journey to Ray on Kubernetes

https://www.infoq.com/news/2025/05/uber-journey-ray-kubernetes/
1•mooreds•28m ago•0 comments

JetBrains AI Pro now included in regular subscription

https://www.jetbrains.com/ai-ides/
2•amit9gupta•28m ago•0 comments

The Apache Attic

https://attic.apache.org//
1•smartmic•29m ago•0 comments

Highlights from the Comments on AI GeoGuessr

https://www.astralcodexten.com/p/highlights-from-the-comments-on-ai
2•feross•31m ago•0 comments

The Screamer – a yell-on yell-off light

https://rulethepla.net/the-screamer/
1•eieio•31m ago•0 comments

The Rise and Fall of the Visual Telegraph (2017)

https://parisianfields.com/2017/11/05/the-rise-and-fall-of-the-visual-telegraph/
3•geox•34m ago•0 comments

Visual Stufio Code version 1.100

https://code.visualstudio.com/updates/v1_100
3•pookieinc•35m ago•1 comments

US Internet traffic rose after announcement of first American pope

https://twitter.com/CloudflareRadar/status/1920552388038938840
2•emot•38m ago•0 comments

Science Is Shaped by Wikipedia: Evidence from a Randomized Control Trial [pdf]

https://ide.mit.edu/sites/default/files/publications/SSRN-id3039505.pdf
1•joebig•39m ago•0 comments

The Truth About AI-Assisted Interviews

https://synervoz.com/blog/the-truth-about-ai-assisted-interviews/
1•trolleycrash•39m ago•0 comments