Generalized K-Means Clustering for Apache Spark with Bregman Divergences

https://github.com/derrickburns/generalized-kmeans-clustering

2•derrickrburns•2h ago

Comments

derrickrburns•2h ago

# HackerNews Announcement

## Title

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

## Body (3,982 characters)

I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.

*Why use this instead of Spark MLlib?*

MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:

- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).

Using the wrong divergence yields mathematically valid but semantically meaningless clusters.

*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean

*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs

*Example:*

```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors

// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)

// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)

// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```

*Quick comparison:*

| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |

*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.

*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation

GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...

This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!

Show HN: RSS Feed Generator

NewsWaffle – Gemini:// News Gateway

Look at how unhinged GPU box art was in the 2000s

Show HN: Hokusai Pocket (WIP) – Portable GUIs with MRuby

Researchers demonstrate centimetre-level positioning using smartwatches

Show HN: ColorMinds – AI Coloring Pages Generator for All Ages

TikTok videos continue to push infostealers in ClickFix attacks

AI hype is excessive, but its productivity gains are real

Ask HN: What do you think about Hetzner pros and cons?

Google phases out Privacy Sandbox

The Warning Signs Lurking Below the Surface of a Record Market

William Gibson, Lisa Simpson and More on Their Favorite Pinch of Pynchon

Carefully Educated to Be Idiots

The Church of Interruption (2011)

Is the Stock Market Going to Crash in 2026?

Show HN: French Retirement Calculator: Baby Boomers vs. Working People

This Week in Gnome #221: Virus Season

General Fusion Sets World Record in Magnetized Target Fusion

City Unions OK Cost-Savings Health Plan Switch Despite Foggy Details

Power-over-Skin: Full-Body Wearables Powered by Intra-Body RF Energy (2024)

If optimizing for commitment doesn't work for you, optimize for balance instead

Chen-Ning (Frank) Yang 1922-2025

Show HN: Why system modeling should look like code, not PowerPoint

OpenAI's 'Embarrassing' Math

Investors should be wary of this year's frenzy for crypto treasuries

Jensen Huang says Nvidia went from 95% market share in China to 0%

Forth: The programming language that writes itself

Networking for Spies: Translating a Cyrillic Text with Claude Code

RSS is still pretty great (2024)

Ask HN: Best way to make a documentation website for an open-source project?

Show HN: RSS Feed Generator

NewsWaffle – Gemini:// News Gateway

Look at how unhinged GPU box art was in the 2000s

Show HN: Hokusai Pocket (WIP) – Portable GUIs with MRuby

Researchers demonstrate centimetre-level positioning using smartwatches

Show HN: ColorMinds – AI Coloring Pages Generator for All Ages

TikTok videos continue to push infostealers in ClickFix attacks

AI hype is excessive, but its productivity gains are real

Ask HN: What do you think about Hetzner pros and cons?

Google phases out Privacy Sandbox

The Warning Signs Lurking Below the Surface of a Record Market

William Gibson, Lisa Simpson and More on Their Favorite Pinch of Pynchon

Carefully Educated to Be Idiots

The Church of Interruption (2011)

Is the Stock Market Going to Crash in 2026?

Show HN: French Retirement Calculator: Baby Boomers vs. Working People

This Week in Gnome #221: Virus Season

General Fusion Sets World Record in Magnetized Target Fusion

City Unions OK Cost-Savings Health Plan Switch Despite Foggy Details

Power-over-Skin: Full-Body Wearables Powered by Intra-Body RF Energy (2024)

If optimizing for commitment doesn't work for you, optimize for balance instead

Chen-Ning (Frank) Yang 1922-2025

Show HN: Why system modeling should look like code, not PowerPoint

OpenAI's 'Embarrassing' Math

Investors should be wary of this year's frenzy for crypto treasuries

Jensen Huang says Nvidia went from 95% market share in China to 0%

Forth: The programming language that writes itself

Networking for Spies: Translating a Cyrillic Text with Claude Code

RSS is still pretty great (2024)

Ask HN: Best way to make a documentation website for an open-source project?

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

Comments