frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: RSS Feed Generator

https://github.com/ctkcoding/rss-feed-generator
1•ctkhn•18s ago•0 comments

NewsWaffle – Gemini:// News Gateway

https://github.com/acidus99/NewsWaffle
1•rickcarlino•1m ago•0 comments

Look at how unhinged GPU box art was in the 2000s

https://www.xda-developers.com/absolutely-unhinged-gpu-box-art-from-the-early-2000s/
4•m-hodges•5m ago•0 comments

Show HN: Hokusai Pocket (WIP) – Portable GUIs with MRuby

https://codeberg.org/skinnyjames/hokusai-pocket
2•zero-st4rs•6m ago•0 comments

Researchers demonstrate centimetre-level positioning using smartwatches

https://www.otago.ac.nz/news/newsroom/researchers-demonstrate-centimetre-level-positioning-using-...
2•geox•10m ago•0 comments

Show HN: ColorMinds – AI Coloring Pages Generator for All Ages

https://www.colormindsai.com
2•learningstone•12m ago•0 comments

TikTok videos continue to push infostealers in ClickFix attacks

https://www.bleepingcomputer.com/news/security/tiktok-videos-continue-to-push-infostealers-in-cli...
2•josephcsible•14m ago•0 comments

AI hype is excessive, but its productivity gains are real

https://www.pcloadletter.dev/blog/ai-hype-and-productivity/
3•ronbenton•20m ago•0 comments

Ask HN: What do you think about Hetzner pros and cons?

3•jerawaj740•21m ago•0 comments

Google phases out Privacy Sandbox

https://www.forbes.com/sites/zakdoffman/2025/10/19/phased-out-google-confirms-bad-news-for-all-3-...
2•pier25•24m ago•1 comments

The Warning Signs Lurking Below the Surface of a Record Market

https://www.wsj.com/finance/stocks/the-warning-signs-lurking-below-the-surface-of-a-record-market...
2•zerosizedweasle•25m ago•0 comments

William Gibson, Lisa Simpson and More on Their Favorite Pinch of Pynchon

https://www.nytimes.com/2025/10/18/books/review/thomas-pynchon-the-simpsons-william-gibson.html
2•pseudolus•27m ago•1 comments

Carefully Educated to Be Idiots

https://www.hilarylayne.com/p/very-carefully-educated-to-be-idiots
4•DavidPiper•30m ago•1 comments

The Church of Interruption (2011)

https://sambleckley.com/writing/church-of-interruption.html
4•sestep•33m ago•3 comments

Is the Stock Market Going to Crash in 2026?

https://www.fool.com/investing/2025/10/19/is-the-stock-market-going-to-crash-in-2026-2-histo/
3•salkahfi•34m ago•1 comments

Show HN: French Retirement Calculator: Baby Boomers vs. Working People

https://julienreszka.github.io/retraites/
2•julienreszka•36m ago•0 comments

This Week in Gnome #221: Virus Season

https://thisweek.gnome.org/posts/2025/10/twig-221/
2•samtheDamned•37m ago•0 comments

General Fusion Sets World Record in Magnetized Target Fusion

https://glassalmanac.com/canada-breaks-world-record-with-600-million-neutrons-per-second-bringing...
3•pseudolus•38m ago•0 comments

City Unions OK Cost-Savings Health Plan Switch Despite Foggy Details

https://www.thecity.nyc/2025/09/30/municipal-unions-emblemhealth-unitedhealthcare/
3•PaulHoule•39m ago•0 comments

Power-over-Skin: Full-Body Wearables Powered by Intra-Body RF Energy (2024)

https://dl.acm.org/doi/10.1145/3654777.3676394
2•zdw•40m ago•0 comments

If optimizing for commitment doesn't work for you, optimize for balance instead

https://herbertlui.net/if-optimizing-for-commitment-doesnt-work-for-you-optimize-for-balance-inst...
3•herbertl•41m ago•0 comments

Chen-Ning (Frank) Yang 1922-2025

https://www.math.columbia.edu/~woit/wordpress/?p=15320
2•chmaynard•43m ago•1 comments

Show HN: Why system modeling should look like code, not PowerPoint

3•twopowerX•44m ago•0 comments

OpenAI's 'Embarrassing' Math

https://techcrunch.com/2025/10/19/openais-embarrassing-math/
4•salkahfi•48m ago•1 comments

Investors should be wary of this year's frenzy for crypto treasuries

https://www.businessinsider.com/digital-asset-treasury-bubble-crypto-bitcoin-bnb-solana-ethereum-...
3•zerosizedweasle•55m ago•0 comments

Jensen Huang says Nvidia went from 95% market share in China to 0%

https://fortune.com/2025/10/19/jensen-huang-nvidia-china-market-share-ai-chips-trump-trade-war/
4•zerosizedweasle•56m ago•0 comments

Forth: The programming language that writes itself

https://ratfactor.com/forth/the_programming_language_that_writes_itself.html
5•suioir•57m ago•0 comments

Networking for Spies: Translating a Cyrillic Text with Claude Code

https://austinpatrick.substack.com/p/networking-for-spies-translating
2•AustinLikesAI•1h ago•0 comments

RSS is still pretty great (2024)

https://www.pcloadletter.dev/blog/rss/
4•ronbenton•1h ago•1 comments

Ask HN: Best way to make a documentation website for an open-source project?

3•mudge•1h ago•1 comments
Open in hackernews

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

https://github.com/derrickburns/generalized-kmeans-clustering
2•derrickrburns•2h ago

Comments

derrickrburns•2h ago
# HackerNews Announcement

## Title

Generalized K-Means Clustering for Apache Spark with Bregman Divergences

## Body (3,982 characters)

I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.

*Why use this instead of Spark MLlib?*

MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:

- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).

Using the wrong divergence yields mathematically valid but semantically meaningless clusters.

*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean

*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs

*Example:*

```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors

// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)

// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)

// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```

*Quick comparison:*

| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |

*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.

*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation

GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...

This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!