frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

https://github.com/noamteyssier/hist-rs
47•noamteyssier•4d ago

Comments

noamteyssier•4d ago
Was sitting around in meetings today and remembered an old shell script I had to count the number of unique lines in a file. Gave it a shot in rust and with a little bit of (over-engineering)™ I managed to get 25x throughput over the naive approach using coreutils as well as improve over some existing tools.

Some notes on the improvements:

1. using csv (serde) for writing leads to some big gains

2. arena allocation of incoming keys + storing references in the hashmap instead of storing owned values heavily reduced the number of allocations and improves cache efficiency (I'm guessing, I did not measure).

There are some regex functionalities and some table filtering built in as well.

happy hacking

theemptiness•1h ago
Small semantics nit: it is not overengineered, it is engineered. You wanted more throughput, the collection of coreutils tools was not designed for throughput but flexibility.

It is not difficult to construct scenarios where throughput matters but that IMHO that does not determine engineering vs overengineering. What matters is whether there are requirements that need to be met. Debating the requirements is possible but doesn't take away from whether a solution obtained with reasonable effort meets the spec. Overengineering is about unreasonable effort, which could lead to overshoot the requirements, not about unreasonable requirements.

mabster•1h ago
We had similar thoughts about "premature optimisation" in the games industry. That is it's better to have prematurely optimised things than finding "everything is slow". But I guess in that context there are many many "inner-most loops" to optimise.
chii•1h ago
> That is it's better to have prematurely optimised things than finding "everything is slow".

or you found that you've optimized a game that is unfun to play and thus doesn't sell, even tho it runs fast...

dbdr•1h ago
> using csv (serde) for writing leads to some big gains

Could you explain that, if you have the time? Is that for writing the output lines? Is actual CSV functionality used? That crate says "Fast CSV parsing with support for serde", so I'm especially confused how that helps with writing.

flowerthoughts•1h ago
The win here might be using HashMap to avoid having to sort all entries. Then sorting at the end instead. What's the ratio of duplicates in the benchmark input?

There is no text encoding processing, so this only works for single byte encodings. That probably speeds it up a little bit.

Depending on the size of the benchmark input, sort(1) may have done disk-based sorting. What's the size of the benchmark input?

wodenokoto•1h ago
To me, the really big win would be _not_ to have to sort at all. Have an option to keep first or last duplicate and remove all others while keeping line order is usually what I need.
mabster•1h ago
I've written this kind of function so many times it's not funny. I usually want something that is fed from an iterator, removes duplicates, and yields values as soon as possible.
thaumasiotes•1h ago
That's easy to do if you're keeping the first duplicate. It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.

You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.

You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.

vlovich123•1h ago
Why does this test against sort | uniq | sort? It’s kind of weird to sort twice no?
BuildTheRobots•1h ago
It's something I've done myself in the past. First sort is because it needs to be sorted for uniq -c to count it proper, second sort because uniq doesn't always give the output in the right order.
evertedsphere•1h ago
more precisely, uniq produces output in the same order as the input to it, just collapsing runs / run-length encoding it
Aaron2222•1h ago

  sort | uniq -c | sort -n
The second sort is sorting by frequency (the count output by `uniq -c`).
gucci-on-fleek•1h ago
The first "sort" sorts the input lines lexicographically (which is required for "uniq" to work); the second "sort" sorts the output of "uniq" numerically (so that lines are ordered from most-frequent to least-frequent):

  $ echo c a b c | tr ' ' '\n'
  c
  a
  b
  c
  
  $ echo c a b c | tr ' ' '\n' | sort
  a
  b
  c
  c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c
        1 a
        1 b
        2 c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c | sort -rn
        2 c
        1 b
        1 a
zX41ZdbW•1h ago
This and similar tasks can be solved efficiently with clickhouse-local [1]. Example:

    ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt
I've tested it and it is faster than both sort and this Rust code:

    time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
    32 sec.

    time hist data.txt > /dev/null
    14 sec.

    time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
    2.7 sec.
It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.

[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...

nasretdinov•1h ago
To be more fair you could also add SETTINGS max_threads=1 though?
supermatt•17m ago
How is that “more fair”?
nasretdinov•1h ago
Note that by default sort command has a pretty low memory usage and spills to disk. You can improve the throughput quite a bit by increasing the allowed memory usage: --buffer-size=SIZE
noctune•1h ago
I built something similarly a few years ago for `sort | uniq -d` using sketches. The downside is you need two passes, but still it's overall faster than sorting: https://github.com/mpdn/sketch-duplicates
Someone•51m ago
> I am measuring the performance of equivalent cat <file> | sort | uniq -c | sort -n functionality.

It likely won’t matter much here, but invoking cat is unnecessary.

   sort <file> | uniq -c | sort -n
will do the job just fine. GNU’s sort also has a few flags controlling buffer size and parallelism. Those may matter more (see https://www.gnu.org/software/coreutils/manual/html_node/sort...)
donatj•45m ago
I created "unic" a number of years ago because I had need to get the unique lines from a giant file without losing the order they initially appeared. It achieves this using a Cuckoo Filter so it's pretty dang quick about it, faster than sorting a large file in memory for sure.

https://github.com/donatj/unic

scaredginger•26m ago
Looks like the impl uses a HashMap. I'd be curious about how a trie or some other specialized string data structure would compare here.
ukuina•22m ago
Neat!

Are there any tools that tolerate slight mismatches across lines while combining them (e.g., a timestamp, or only one text word changing)?

I attempted this with a vector DB, but the embeddings calculation for millions of lines is prohibitive, especially on CPU.

majke•1m ago
I thought my mmuniq holds the crown!

https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

https://github.com/majek/mmuniq

How I turned Zig into my favorite language to write network programs in

https://lalinsky.com/2025/10/26/zio-async-io-for-zig.html
189•0x1997•8h ago•54 comments

Recall for Linux

https://github.com/rolflobker/recall-for-linux
7•anticensor•1h ago•3 comments

An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

https://github.com/noamteyssier/hist-rs
48•noamteyssier•4d ago•24 comments

Why I'm teaching kids to hack computers

https://www.hacktivate.app/why-teach-kids-to-hack
28•twostraws•4d ago•8 comments

Show HN: MyraOS – My 32-bit operating system in C and ASM (Hack Club project)

https://github.com/dvir-biton/MyraOS
176•dvirbt•11h ago•36 comments

Show HN: Write Go code in JavaScript files

https://www.npmjs.com/package/vite-plugin-use-golang
24•yar-kravtsov•2h ago•9 comments

Sandhill cranes have adopted a Canada gosling

https://www.smithsonianmag.com/science-nature/these-sandhill-cranes-have-adopted-a-canadian-gosli...
99•NaOH•4d ago•17 comments

You already have a Git server

https://maurycyz.com/misc/easy_git/
503•chmaynard•21h ago•354 comments

Are-we-fast-yet implementations in Oberon, C++, C, Pascal, Micron and Luon

https://github.com/rochus-keller/Are-we-fast-yet
59•luismedel•9h ago•12 comments

Structure and Interpretation of Classical Mechanics

https://tgvaughan.github.io/sicm/toc.html
10•the-mitr•4h ago•1 comments

Ken Thompson recalls Unix's rowdy, lock-picking origins

https://thenewstack.io/ken-thompson-recalls-unixs-rowdy-lock-picking-origins/
148•dxs•15h ago•17 comments

A definition of AGI

https://arxiv.org/abs/2510.18212
211•pegasus•14h ago•333 comments

We Saved $500k per Year by Rolling Our Own "S3"

https://engineering.nanit.com/how-we-saved-500-000-per-year-by-rolling-our-own-s3-6caec1ee1143
179•mpweiher•11h ago•147 comments

Sphere Computer – The Innovative 1970s Computer Company Everyone Forgot

https://sphere.computer/
55•ChrisArchitect•3d ago•4 comments

A bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/
379•bblcla•3d ago•74 comments

NORAD’s Cheyenne Mountain Combat Center, c.1966

https://flashbak.com/norad-cheyenne-mountain-combat-center-478804/
113•zdw•6d ago•57 comments

Tamper-Sensing Meshes Using Low-Cost, Embedded Time-Domain Reflectometry

https://jaseg.de/blog/paper-sampling-mesh-monitor/
11•luu•3d ago•1 comments

Feed the bots

https://maurycyz.com/misc/the_cost_of_trash/
200•chmaynard•20h ago•149 comments

A Looking Glass Half Empty, Part 2: A Series of Unfortunate Events

https://www.filfre.net/2025/10/a-looking-glass-half-empty-part-2-a-series-of-unfortunate-events/
16•ibobev•6d ago•0 comments

Researchers demonstrate centimetre-level positioning using smartwatches

https://www.otago.ac.nz/news/newsroom/researchers-demonstrate-centimetre-level-positioning-using-...
51•geox•1w ago•15 comments

Asbestosis

https://diamondgeezer.blogspot.com/2025/10/asbestosis.html
252•zeristor•1d ago•187 comments

Poison, Poison Everywhere

https://loeber.substack.com/p/29-poison-poison-everywhere
195•dividendpayee•9h ago•100 comments

Show HN: Helium Browser for Android with extensions support, based on Vanadium

https://github.com/jqssun/android-helium-browser
49•jqssun•9h ago•19 comments

Eavesdropping on Internal Networks via Unencrypted Satellites

https://satcom.sysnet.ucsd.edu/
197•Bogdanp•6d ago•30 comments

Books by People – Defending Organic Literature in an AI World

https://booksbypeople.org/
81•ChrisArchitect•15h ago•78 comments

Wren: A classy little scripting language

https://wren.io/
156•Lyngbakr•4d ago•43 comments

Microsoft 365 Copilot – Arbitrary Data Exfiltration via Mermaid Diagrams

https://www.adamlogue.com/microsoft-365-copilot-arbitrary-data-exfiltration-via-mermaid-diagrams-...
185•gnabgib•9h ago•33 comments

Searching for Charles Fourier in the ruins of a socialist utopia outside LA

https://kubicki.org/letters/the-dogs-of-llano-del-rio-i/
21•kosmavision•1w ago•1 comments

Making the Electron Microscope

https://www.asimov.press/p/electron-microscope
73•mailyk•15h ago•8 comments

Downloadable movie posters from the 40s, 50s, 60s, and 70s

https://hrc.contentdm.oclc.org/digital/collection/p15878coll84/search
443•bookofjoe•1w ago•84 comments