frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Generalized K-Means Clustering

https://github.com/derrickburns/generalized-kmeans-clustering
2•derrickrburns•2h ago

Comments

derrickrburns•2h ago
# HackerNews Announcement

## Title

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

## Body (3,982 characters)

I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.

*Why use this instead of Spark MLlib?*

MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:

- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).

Using the wrong divergence yields mathematically valid but semantically meaningless clusters.

*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean

*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs

*Example:*

```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors

// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)

// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)

// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```

*Quick comparison:*

| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |

*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.

*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation

GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...

This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!

Ten Lessons I Wish I Had Been Taught (1997) [pdf]

https://www.ams.org/notices/199701/comm-rota.pdf
1•fi-le•20s ago•0 comments

Data Centers Are Getting Big

https://www.distilled.earth/p/these-data-centers-are-getting-really
1•walterbell•3m ago•0 comments

Vaclav Smil on why there will be no energy transition

https://energyskeptic.com/2025/vaclav-smil-on-why-there-will-be-no-energy-transition/
1•measurablefunc•3m ago•0 comments

Apple, Samsung Report Underwhelming Sales of Their New Thin Smartphones

https://www.macrumors.com/2025/10/17/iphone-air-production-to-be-cut-amid-lower-sales/
2•m463•4m ago•0 comments

Show HN: Predictive Thermal Management for a Production Phone Server

1•DaSettingsPNGN•4m ago•0 comments

OnionWatch: Monitor Sites Inside the Tor Network and Beyond

https://onionwatch.app/
1•falkensmaze66•5m ago•1 comments

Svalbard

https://photoblog.nk412.com/Svalbard2024/n-ssC8fP/Svalminiphotoblog
1•grilledchickenw•5m ago•0 comments

Advances in AI will boost productivity, living standards over time

https://www.dallasfed.org/research/economics/2025/0624
1•fcpguru•6m ago•0 comments

Ask HN: Why isn't Amazon.com impacted by AWS outages?

2•trevoragilbert•6m ago•1 comments

Tech Brief: AI Sycophancy and OpenAI

https://www.law.georgetown.edu/tech-institute/insights/tech-brief-ai-sycophancy-openai-2/
1•jruohonen•6m ago•0 comments

iOS 26.1 lets users control Liquid Glass transparency

https://www.macrumors.com/2025/10/20/ios-26-1-liquid-glass-toggle/
2•dabinat•6m ago•0 comments

Ultra-low-power ring-based wireless tinymouse

https://dl.acm.org/doi/10.1145/3746059.3747615
1•PaulHoule•7m ago•0 comments

J.P. Morgan's OpenAI loan is strange

https://marketunpack.com/j-p-morgans-openai-loan-is-strange/
8•vrnvu•7m ago•0 comments

Glitches in Sora 2 World

https://old.reddit.com/r/SoraAi/comments/1oba4bw/glitches_in_sora_2_world/
1•taesiri•7m ago•0 comments

When a stadium adds AI to everything, it's worse experience for everyone

https://a.wholelottanothing.org/bmo-stadium-in-la-added-ai-to-everything-and-what-they-got-was-a-...
3•wawayanda•8m ago•0 comments

JM Smucker sues Trader Joe's over 'copycat' Uncrustables sandwiches

https://www.fooddive.com/news/jm-smucker-sues-trader-joes-over-copycat-uncrustables-sandwiches/80...
1•sizzle•8m ago•0 comments

Ask HN: Is pivoting and focus shifting betrayal of existing users?

2•taubek•9m ago•2 comments

'This is definitely my last TwitchCon': Streamer Emiru assaulted at event

https://www.pcgamer.com/gaming-industry/this-is-definitely-my-last-twitchcon-high-profile-streame...
1•TMWNN•9m ago•1 comments

Banks Misplaced Some Mortgages

https://www.bloomberg.com/opinion/newsletters/2025-10-20/banks-misplaced-some-mortgages
2•ioblomov•10m ago•1 comments

Krea Realtime 14B: an open-source real-time video model

https://www.krea.ai/blog/krea-realtime-14b
1•dvrp•12m ago•0 comments

Ask HN: Is AI not creative because its creators don't care about creativity?

2•amichail•13m ago•0 comments

Job Interviews Are Broken

https://www.theatlantic.com/technology/2025/10/ai-cheating-job-interviews-fraud/684568/
1•cebert•15m ago•2 comments

Oxlint JavaScript Plugin Support

https://voidzero.dev/posts/announcing-oxlint-js-plugins
1•kevinak•16m ago•0 comments

DocMind, Streamlit Application Leveraging LlamaIndex, LangGraph, and LLM

https://github.com/BjornMelin/docmind-ai-llm
1•nashashmi•17m ago•0 comments

Can Sonnet 4.5 hack a network?

https://www.incalmo.ai/blog/2025/10/01/sonnet_eval/
2•bsingerzero•18m ago•0 comments

Dependency on monopoly cloud providers a security vulnerability?

https://www.euronews.com/next/2025/10/20/huge-internet-outage-hits-mobile-apps-and-websites-such-...
3•DaveZale•19m ago•0 comments

AI Trading in Real Markets

https://nof1.ai/
1•hiddencost•19m ago•2 comments

The AI Cloud

https://rauchg.com/2025/the-ai-cloud#pages-to-agents
2•tamasnet•20m ago•0 comments

The Price of E. Coli

https://press.asimov.com/articles/price-of-ecoli
1•mailyk•24m ago•0 comments

Bedside Manners: Can empathy be taught in medicine?

https://harpers.org/archive/2025/10/bedside-manners-rachel-pearson-empathy-medical-education/
1•Vigier•26m ago•0 comments