frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
126•ColinWright•1h ago•93 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
24•surprisetalk•1h ago•26 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
121•AlexeyBrin•7h ago•24 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
125•alephnerd•2h ago•81 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
62•vinhnx•5h ago•7 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
829•klaussilveira•21h ago•249 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
55•thelok•3h ago•8 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
110•1vuio0pswjnm7•8h ago•139 comments

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
4•gnufx•41m ago•1 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
1060•xnx•1d ago•611 comments

Reinforcement Learning from Human Feedback

https://rlhfbook.com/
76•onurkanbkrc•6h ago•5 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
484•theblazehen•2d ago•175 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
10•valyala•2h ago•1 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
210•jesperordrup•12h ago•70 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
9•valyala•2h ago•0 comments

France's homegrown open source online office suite

https://github.com/suitenumerique
559•nar001•6h ago•257 comments

Coding agents have replaced every framework I used

https://blog.alaindichiappari.dev/p/software-engineering-is-back
223•alainrk•6h ago•343 comments

A Fresh Look at IBM 3270 Information Display System

https://www.rs-online.com/designspark/a-fresh-look-at-ibm-3270-information-display-system
37•rbanffy•4d ago•7 comments

Selection Rather Than Prediction

https://voratiq.com/blog/selection-rather-than-prediction/
8•languid-photic•3d ago•1 comments

History and Timeline of the Proco Rat Pedal (2021)

https://web.archive.org/web/20211030011207/https://thejhsshow.com/articles/history-and-timeline-o...
19•brudgers•5d ago•4 comments

72M Points of Interest

https://tech.marksblogg.com/overture-places-pois.html
29•marklit•5d ago•2 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
114•videotopia•4d ago•31 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
76•speckx•4d ago•75 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
6•momciloo•2h ago•0 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
273•isitcontent•22h ago•38 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
201•limoce•4d ago•111 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
22•sandGorgon•2d ago•11 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
286•dmpetrov•22h ago•154 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
155•matheusalmeida•2d ago•48 comments

Software factories and the agentic moment

https://factory.strongdm.ai/
71•mellosouls•4h ago•75 comments
Open in hackernews

Generalized K-Means Clustering

https://github.com/derrickburns/generalized-kmeans-clustering
41•derrickrburns•3mo ago

Comments

derrickrburns•3mo ago
# HackerNews Announcement

## Title

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

## Body (3,982 characters)

I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.

*Why use this instead of Spark MLlib?*

MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:

- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).

Using the wrong divergence yields mathematically valid but semantically meaningless clusters.

*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean

*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs

*Example:*

```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors

// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)

// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)

// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```

*Quick comparison:*

| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |

*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.

*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation

GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...

This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!

seanhunter•3mo ago
For people who are unfamiliar, k-means is a partitioning algorithm that aims to group observations into a specific number (k) of clusters in such a way that each observation ends up in the cluster with the “nearest” mean. So say you want 5 groups, it will make five groups so that every observation is in the group where it’s nearest to the mean.

And so that raises the question of what “nearest” means, and here this allows you to replace Euclidian distance with things like Kullback-Leibler divergence (that’s the KL below) which make more sense than Euclidian distance if you’re trying to measure how close two probability distributions are to each other.

nurettin•3mo ago
> And so that raises the question of what “nearest” means

To me, the definition of "nearest" is just a technicality.

The real question is: what is K?

mentalgear•3mo ago
Have you tried HDBSCAN (DBSCAN variant) or Hierarchical Clustering (HAC) ?
nurettin•3mo ago
Me? I probably tried every classification algorithm and their H variants. I still think "What is K?" is a profound question.
keeeba•3mo ago
I agree it is a profound question. My thesis is fairly boring.

For any given clustering task of interest, there is no single value of K.

Clustering & unsupervised machine learning is as much about creating meaning and structure as it is about discovering or revealing it.

Take the case of biological taxonomy, what K will best segment the animal kingdom?

There is no true value of K. If your answer is for a child, maybe it’ 7 corresponding to what we’re taught in school - mammals, birds, reptiles, amphibians, fish, and invertebrates.

If your answer is for a zoologist, obviously this won’t do.

Every clustering task of interest is like this. And I say of interest because clustering things like digits in the classic MNIST dataset is better posed as a classification problem - the categories are defined analytically.

seanhunter•3mo ago
K is whatever you want it to be. You want 5 clusters k=5. If you don’t know the right number of clusters try a few different values of k and see which partitions your sample in a way that’s good for your problem
Spivak•3mo ago
The total number of clusters. Determining this algorithmically is a fun open problem https://en.wikipedia.org/wiki/Determining_the_number_of_clus....

For the data I work with at $dayjob I've found the Silhouette algorithm to perform best but I assume it will be extremely field specific. Clustering your data and taking a representative sample of each cluster is such a powerful trick to make big data small but finding an appropriate K is an art more than a science.

dcl•3mo ago
At a previous $dayjob at a very large financial institution, it's however many clusters are present in the strategy that was agreed to by the exec team and their highly paid consultants.

You find that many clusters and shoehorn the consultant provided categories on to the k clusters you obtain.

3abiton•3mo ago
To be fair finding K is highly domain dependent and I would argue should not be for the analyst (solely) to decide, but with a feedback from domain experts.
apwheele•3mo ago
Can folks comment on what applications they use k-means for? It was a basic technique I learned in school, but honestly I am not really familiar with a single use case that is very clearly motivated besides "pretty pictures".

So I do a bit of work in geospatial analysis, and hotspots are better represented by DBSCAN (do not need to assign every point a cluster). I just do not even use clustering very often in gig (supervised ML and anomaly detection are much more prevalent in the rest of my work).

atiedebee•3mo ago
It's used for vector quantization which can be used for color quantization