frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Generalized K-Means Clustering

https://github.com/derrickburns/generalized-kmeans-clustering
17•derrickrburns•5d ago

Comments

derrickrburns•5d ago
# HackerNews Announcement

## Title

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

## Body (3,982 characters)

I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.

*Why use this instead of Spark MLlib?*

MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:

- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).

Using the wrong divergence yields mathematically valid but semantically meaningless clusters.

*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean

*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs

*Example:*

```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors

// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)

// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)

// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```

*Quick comparison:*

| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |

*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.

*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation

GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...

This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!

seanhunter•58m ago
For people who are unfamiliar, k-means is a partitioning algorithm that aims to group observations into a specific number (k) of clusters in such a way that each observation ends up in the cluster with the “nearest” mean. So say you want 5 groups, it will make five groups so that every observation is in the group where it’s nearest to the mean.

And so that raises the question of what “nearest” means, and here this allows you to replace Euclidian distance with things like Kullback-Leibler divergence (that’s the KL below) which make more sense than Euclidian distance if you’re trying to measure how close two probability distributions are to each other.

A worker fell into a nuclear reactor pool

https://www.nrc.gov/reading-rm/doc-collections/event-status/event/2025/20251022en?brid=vscAjql9kZ...
376•nvahalik•7h ago•232 comments

Pico-Banana-400k

https://github.com/apple/pico-banana-400k
208•dvrp•6h ago•22 comments

The Linux Boot Process: From Power Button to Kernel

https://www.0xkato.xyz/linux-boot/
223•0xkato•9h ago•49 comments

What If Tariffs?

https://www.swatch.com/en-en/what-if-tariffs-so34z106/SO34Z106.html
47•Erikun•36m ago•14 comments

California invests in battery energy storage, leaving rolling blackouts behind

https://www.latimes.com/environment/story/2025-10-17/california-made-it-through-another-summer-wi...
262•JumpCrisscross•13h ago•199 comments

The Journey Before main()

https://amit.prasad.me/blog/before-main
210•amitprasad•13h ago•74 comments

GenAI Image Editing Showdown

https://genai-showdown.specr.net/
75•rzk•6h ago•17 comments

PCB Edge USB C Connector Library

https://github.com/AnasMalas/pcb-edge-usb-c
70•walterbell•6h ago•33 comments

D2: Diagram Scripting Language

https://d2lang.com/tour/intro/
123•benzguo•10h ago•21 comments

Bitmovin (YC S15) Is Hiring Engineering ICs and Managers in Europe

https://bitmovin.com/careers
1•slederer•1h ago

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
29•hessdalenlight•20h ago•2 comments

Project Amplify: Powered footwear for running and walking

https://about.nike.com/en/newsroom/releases/nike-project-amplify-official-images
75•justinmayer•12h ago•80 comments

Show HN: Diagram as code tool with draggable customizations

https://github.com/RohanAdwankar/oxdraw
175•RohanAdwankar•12h ago•38 comments

NextSilicon reveals new processor chip in challenge to Intel, AMD

https://www.reuters.com/business/nextsilicon-reveals-new-processor-chip-challenge-intel-amd-2025-...
61•simojo•3d ago•12 comments

How programs get run: ELF binaries (2015)

https://lwn.net/Articles/631631/
98•st_goliath•11h ago•4 comments

Cubical Quad Antennas and Margaret's Letter

http://ei3lh.eu/?p=88
3•austinallegro•3d ago•0 comments

Why I code as a CTO

https://www.assembled.com/blog/why-i-code-as-a-cto
147•johnjwang•1d ago•100 comments

An Update on TinyKVM

https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e9
110•ingve•12h ago•25 comments

Doctor Who archive expert shares positive update on missing episode

https://www.radiotimes.com/tv/sci-fi/doctor-who-missing-episodes-update-teases-announcement-newsu...
80•gnabgib•6d ago•39 comments

Agent Lightning: Train agents with RL (no code changes needed)

https://github.com/microsoft/agent-lightning
74•bakigul•12h ago•10 comments

AI, Wikipedia, and uncorrected machine translations of vulnerable languages

https://www.technologyreview.com/2025/09/25/1124005/ai-wikipedia-vulnerable-languages-doom-spiral/
92•kawera•13h ago•43 comments

WebDAV isn't dead yet

https://blog.feld.me/posts/2025/09/webdav-isnt-dead-yet/
164•toomuchtodo•1d ago•76 comments

Show HN: Shadcn/UI theme editor – Design and share Shadcn themes

https://shadcnthemer.com
105•miketromba•13h ago•33 comments

Rock Tumbler Instructions

https://rocktumbler.com/tips/rock-tumbler-instructions/
174•debo_•16h ago•82 comments

Generalized K-Means Clustering

https://github.com/derrickburns/generalized-kmeans-clustering
17•derrickrburns•5d ago•2 comments

Tsdown – The Elegant Bundler for Libraries

https://tsdown.dev/
16•jcbhmr•5h ago•5 comments

Passwords and Power Drills

https://google.github.io/building-secure-and-reliable-systems/raw/ch01.html#on_passwords_and_powe...
89•harporoeder•5d ago•20 comments

We do not have sufficient links to the UK for Online Safety Act to be applicable

https://libera.chat/news/advised
242•todsacerdoti•15h ago•73 comments

Making a micro Linux distro (2023)

https://popovicu.com/posts/making-a-micro-linux-distro/
170•turrini•19h ago•28 comments

An Efficient Implementation of SELF (1989) [pdf]

https://courses.cs.washington.edu/courses/cse501/15sp/papers/chambers.pdf
46•todsacerdoti•11h ago•22 comments