Show HN: Real-time system that tracks how news spreads across 200k websites

256•antiochIst•2mo ago

I built a system that monitors ~200,000 news RSS feeds in near real-time and clusters related articles to show how stories spread across the web.

It uses Snowflake’s Arctic model for embeddings and HNSW for fast similarity search. Each “story cluster” shows who published first, how fast it propagated, and how the narrative evolved as more outlets picked it up.

Would love feedback on the architecture, scaling approach, and any ways to make the clusters more accurate or useful.

Live demo: https://yandori.io/news-flow/

Comments

masterphai•2mo ago

Interesting project - it’s rare to see news-flow tracking done in real time at this scale. One thing you may want to stress-test is how stable the clustering remains when stories evolve semantically over a few hours. Embeddings tend to drift as outlets rewrite or localize a piece, and HNSW can sometimes over-merge when the centroid shifts.

A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.

Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.

Overall, really nice work. The propagation timeline is especially useful.

supriyo-biswas•2mo ago

Thanks for your comment, unfortunately it seems that your comments are primarily LLM-generated (for people looking for evidence, the first comments of this user should provide enough evidence, although they’re getting better by fine tuning the prompt). As HN is primarily a place for humans, please do not do this here. Thanks.

yieldcrv•2mo ago

How can I bait this bot?

alchemist1e9•2mo ago

The style of the account comments and “about” definitely give off LLM vibes, but it’s not a particularly active account so I feel not a true bot. It’s also possible the account owner just runs their own comment through an LLM before posting it. I do that for most business emails I send these days but they are still reflecting my own thoughts and details.

nextaccountic•2mo ago

this apecific comment shows no sign of LLM authorship

maybe the author uses LLMs in some comments and not others. that is, it's not a bot, just someone manually using LLM tools sometimes

wcallahan•2mo ago

Bad bot.

‘masterphai’ is evidence of how effective a good LLM and better prompt can be now at evading detection of AI authorship… but there’s no way this authors comments are written by a sane human.

From the comment history it appears it has tricked quite a few humans to-date. Interesting!

Oras•2mo ago

I really like the idea. I would love a feature to add keywords and see related news.

KomoD•2mo ago

I think the idea is interesting but it includes a lot of spam and non-news (e.g. archive.fo, .vn, .today, etc.)

psychoslave•2mo ago

Can it be tuned to get a sense of how it reach Wikimedia projects?

hmokiguess•2mo ago

Cool idea! What I liked the most was the breakdown into categories like “breaking” and “trending” plus the number of sources.

The view showing the flow with a play animation was a nice concept but I couldn’t see much value in it, wondering if you could try to get a more aggregate stats that shows a connection between these different flows, maybe they follow a pattern like ad-based campaigns or publishers who own these domains, which would explain things. Expanding on this idea, could even try and setup different scores and metrics based on major groups and sponsored content versus organic spread.

jMyles•2mo ago

Just tried it, and clicking on the stories doesn't seem to do anything. Console shows "TypeError: can't access property "time", flowData[Math.min(...)] is undefined"

Ubuntu 24.04, Firefox 145.0.1 (64-bit)

guillem_lefait•2mo ago

same

antiochIst•2mo ago

Thanks - I got fix for this.

juujian•2mo ago

Very cool. Our lab will want to do something like this eventually. Do you have a repo?

Havoc•2mo ago

That's really cool!

Curious how you sourced the feeds? It seems to have a bias towards Indian/Srilanka/Iran/Indonesia/Turkey etc - i.e. not the traditional western centric reporting. Always interested in trying to get a more balanced news diet so anything you could share around that would be interesting. Most out of the box news tools seem to automatically lean west

FYI layout sometimes breaks like so:

https://i.imgur.com/FXeqB9R.png

supermatt•2mo ago

“Traditional western reporting” is traditionally a western thing. That’s only 15% of the global population - so if anything it seems bias towards that.

antiochIst•2mo ago

I'm polling rss feeds from a bunch top 200k sites in the world.

Thanks for that bug feedback - ill get fix.

ewuhic•2mo ago

Without evaluating it thoroughly and judging just from description - I really hope this ends up open-sourced - will help drastically to many good-intent parties.

rvz•2mo ago

This looks a lot like a combination of spam and slop posed as "breaking news".

> Opinion: Operation Holiday serves a critical need in our communities

> Dhru Fusion WooCommerce Integration Plugin

> Powering the Future of Wellness Through Premium Food Supplement Ingredients

That isn't even remotely important at all so really unreliable.

hopelite•2mo ago

Have you ever considered providing feedback in a constructive and supportive manner?

rvz•2mo ago

I am just being a substantiative counterweight so that everyone gets the full picture whilst being objective at the same time.

The following headlines look more like spam rather than factual breaking news.

antiochIst•2mo ago

Yea there is some spam stuff for sure... working on improving filtering it out...

I get most of it, but I think especially around the holiday some stuff is getting through... Some black friday deals were actually hitting like news does...

dmix•2mo ago

How do you handle time zone issues with the dates?

I’ve been curious how much news starts from social media. So many news stories today are “someone said x on twitter”.

antiochIst•2mo ago

ehh, timezones handles just with some basic parsing logic...

I'm not pulling from social media yet.

dmix•2mo ago

> ehh, timezones handles just with some basic parsing logic...

I hope so. In my experience it's never that simple with dynamic data. Even with predictable data timezones cause issues. You're putting a lot of value in order in your visualizations.

YmiYugy•2mo ago

The idea is pretty cool, but it doesn't work super well. 1. I imagine most major news outlets don't have RSS feeds these days. 2. A lot of stuff originates from news agencies, so they don't spread from website to website, but radiate out from the agency. 3. Most of the included sources are pretty small. To draw meaningful conclusions we would need infos like popularity, political leaning, nation of origin, etc. 4. The similarity check doesn't appear to do translation. So when news spreads from one country to another we loose the thread.

badestrand•2mo ago

The devil really is always in the details.

Joel_Mckay•2mo ago

Being consistent in message framing even when its not in the best interest of the public should not reasonably be considered "news" =3

https://en.wikipedia.org/wiki/Sinclair_Broadcast_Group

https://www.youtube.com/watch?v=GvtNyOzGogc

andai•2mo ago

>the similarity check doesn't appear to do translation

This surprises me. The system is based on embeddings. AFAIK embeddings cluster the same concept in different languages in roughly the same place? Maybe it depends on the model (or maybe it's not exact and the clustering cutoff loses it).

antiochIst•2mo ago

I'm basically throwing away non english articles for now... I'll pry get them in later, but I want to get english right first before trying to move to other languages...

The embeddings themselves will (pry) cluster ok in different languages (but I have not tested this yet)

fcarraldo•2mo ago

> I imagine most major news outlets don't have RSS feeds these days

I’m not aware of any that don’t. RSS is alive and well.

dleeftink•2mo ago

Also, not all information spreads through public channels, and might not even be/become publicly known. But that doesn't mean news refraction based on textual similarity isn't worthwhile to pursue, as it can reveal a lot about the self-organising principles by which the media operate.

Animats•2mo ago

Yes. For example, this story about Ukraine [1] is credited to WNYT as first, but the story itself credits the Associated Press. This problem is worth solving, because it's something search engines should be doing.

[1] https://wnyt.com/ap-top-news/rubio-says-us-ukraine-talks-on-...

antiochIst•2mo ago

yea, what im currently doing is pretty simple check on published at date from the rss feed (with some small validation checks)... but its causing issues bc it can be wrong and mess up everything...

I think checking source in story is next step...

justin66•2mo ago

Treating the Associated Press as a special case might be worthwhile. Its stories will appear in hundreds of places, some with a little alteration and some fully intact.

antiochIst•2mo ago

Yea not all major have rss feeds, but it seems like the majority still do.

No translation yet.

I think the biggest problem is im relying on published date from the news source itself too much and its wrong sometimes... not super often, but if 1 out of 100 sources get its wrong then it can steal credit for being source article when its not.

pbiggar•2mo ago

See also Newscord, which does very similar work to analyze bias across news media:

- https://newscord.org/latest

- https://www.instagram.com/newscord_org

codethief•2mo ago

Cool idea! On mobile (Chromium on Android) I was confused at first because nothing happened when I tapped any of the stories – until I realized I can zoom out and the info about how the story propagated is at the end of the page.

hk1337•2mo ago

This seems like it could have an additional use case of labeling each news source left, right, center, neutral/factual and tracking how or if each one releases an article.

patrick4urcloud•2mo ago

great !

gioele•2mo ago

Kudos on releasing Yandori!

We have been (low-keep) working on something similar (more from an academic point of view) for the past few years:

This is the introductory article (open access): "Comparison of news commonality and churn in international news outlets with TARO" https://dl.acm.org/doi/abs/10.1145/3603163.3609062

(Allow me a moment of pride for the student leading this project: the paper won the Ted Nelson Award at ACM Hypertext 2023.)

jacquesm•2mo ago

Is there a way you could use this system to track propaganda?

analogears•2mo ago

Tried this on iPhone - the category tabs (Sports, World News, Business) get cut off on the right and there's no horizontal scroll indicator, so I didn't realise there were more options at first. The story cards also aren't using the full screen width, leaving wasted space on both sides.

Cool concept though - the source count and "+N" spread metrics give a quick sense of which stories have legs.

kburman•2mo ago

This is absolutely brilliant. If you integrate Reddit and Twitter/X, you’d get a much more complete picture of how stories spread across the internet.

A4ET8a8uTh0_v2•2mo ago

Yep. I have some suspicions on how the information travels lately ( it is kinda both ways depending on the 'type' of news ), but it would absolutely be of general interest.

A4ET8a8uTh0_v2•2mo ago

Good idea. Clean execution. Nice UI. I will repeat other poster's plea to make it open source. The information this provides is useful.

leobg•2mo ago

I dream of having that for video:

For any given clip, short or excerpt, find the most complete, unedited version that it was taken from.

actinium226•2mo ago

Very cool. I'm curious what frontend and backend technologies are used?

keepamovin•2mo ago

I feel a graph diagram (hub and spoke, showing "data flow") would be a useful alt view here.

Cool website. As others note if this could tie in deep sources like FB, X, Reddit, etc...it would be almost "chain of evidence" canonical.

A view where websites/sources were associated with geo data (possibly involving a globe or map) would be very cool, too.

Triphibian•2mo ago

This is interesting, but it seems like it is tracking stories with similar headlines and that's not always how news propagates. Frequently a blogger will read an interview, select an quote from the interview and write a new headline around the quote they cherry picked. It used to be common practice to link the original source, but that always doesn't happen.

I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story, but things have not worked out that way. If you can manage to truly develop something like this it would be a valuable tool for rewarding the work of reporting over SEO.

Anyway, please consider that headlines and time stamps do not tell the entire story when it comes to sourcing.

For example: Your website offers this story (https://hotspotatl.com/6587626/dr-jackie-married-to-medicine...) as first to publish. But right in the text it cites another website BOSSIP as the source of the interview.

Also: there doesn't appear to be a way to link results from your website.

dkdcio•2mo ago

not linking primary sources is one of my biggest pet peeves with modern ad-driven “journalism”.

e.g. the recent Mark Kelly story, I went through many articles trying to find a link to the actual video of what he said. couldn’t find it

headlines with “[person said X]” tend to be bullshit

cruffle_duffle•2mo ago

Go hunt down the lineage of the “AI water use” articles floating around.

It’s all circular.

I don’t know how one is supposed to trust any of the media at this point. Especially “reputable” ones that are just as guilty of circular nonsense as anything else.

If you don’t follow the media, you are uninformed. If you follow it, you are misinformed.

tbrownaw•2mo ago

> I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story

This is complicated somewhat by the few that take an already-circulating story and then add their own actual research rather than just rewording and opining.

antiochIst•2mo ago

Currently I'm using Snowflake’s Arctic embedding model on the whole story not just the title, to cluster stories. There are still some issues, but its not as simple as looking at title publish date.

Yea, I need to do some work on improving first to publish... currently I'm relying pretty heavily on the published date provided in the story itself, but sometimes that is wrong and makes it look like a later publisher was first to publish.

65•2mo ago

I think you will need to filter out wire services like AP and Reuters, as I'm seeing stories that are mostly republished wire stories on random websites.

wcallahan•2mo ago

Instead of filtering them out, I’d imagine you’d want to establish their equivalency instead? Then they can be made available as equal/similar alternatives to the same article (i.e., from your outlet of choice).

andai•2mo ago

This is related to my interests!

Where'd you find all those RSS feeds? Have you done anything else with RSS feeds? :)

Also agree with the others this definitely needs interactive graphs!

jauntywundrkind•2mo ago

I feel like there's a huge necessary civil virtue to this sort of understanding the news project.

Thanks for sharing some details. Its cool that HNSW is useful for near realtime usage. For some reason I had categorized it in my head as having very very high insertion cost, needing to rebuild worlds to work but that's not at a well founded belief; very cool that it's usable here.

I really hope we see some open source work of this variety. Trying to understand news or even social media is something the world seems to unprepared for. Different subject sort of, but watching Internet Observatory be dismantled by the current political administration, by disinformation grifters, was a woeful loss of one of the few mirrors the that humanity had to understand itself with, to see how we networked.

ble•2mo ago

Cool idea. Given that it transferred ~29 mb when loading, is it safe to assume that the actual page is doing some of the processing? Is the front-end just doing the HNSW or is it doing the mapping of stories or headlines into vectors, or am I totally off base?

Front-end downstream of clicking on a card doesn't seem to work correctly on every reload... but it works sometimes.

hbarka•2mo ago

It’s performing really slow right now. Is it possible to tell if virality of a news article is organic or manufactured? Organic is when it is produced by a reporting organization but can you see direct lineage to re-spreaders?

antiochIst•2mo ago

You can kinda tell based on the distribution. Organic spread has less similarity between articles, less syndication, more spread out in timeline of releases...

Some stories are very clearly manufactured

lmeyerov•2mo ago

I don't see news spread, eg, direct lineage graphs showing viral attribution & rewrites as a narrative propagates..

Afaict, it is the usual topic trending over time, or maybe it is showing direct sindication?

Computing actual derivation flow would be neato, esp precisely at scale vs just the usual embeddings

maximator•2mo ago

Some time ago, I wrote a scientific article in which I applied and modified the SIR model of disease spread to the spread of fake news. I simulated the whole thing in a Watts-Strogaz graph. It would be interesting to see whether the theory and formula are applicable to the real world.

SilverElfin•2mo ago

Is there something similar that could be built to track spreading across social media? For example to track misinformation and its patterns? Or is that no longer possible because of changes to the Twitter API or whatever?

antiochIst•2mo ago

I think this could be done, but would require paying more than I want to for the highest level of api access...

antiochIst•2mo ago

FYI - I'll integrate this if anyone want to pay for twitter api fees.

elorant•2mo ago

More important for me is how you identify news sites, let alone 200k of them. Is there any online source that lists them? Or do you cherry pick them one by one?

rationably•2mo ago

And to add to the above, is there a list of the websites you use and any information on sampling methodology? Is it perfectly random or weighted? Do you trust the timestamp from an RSS feed?

antiochIst•2mo ago

It's a whole thing... I run a project called websitelaunches, so I have index of basically the whole internet (500M+) sites. I took the top ~200k news related sites from there that had rss feed.

cyrusradfar•2mo ago

I'm a huge fan of the general space and I think this is a really solid approach vector to learn what user problems exist in this design.

I'll dump a few thoughts as they come for the creators, feel free to riff with me on the thread if that'll be of value.

My perspective, as a User, is I'm interested in rooting out bias and where it's coming from. Moreover, the influence networks are fascinating as well.

I think, for example, understanding which publications "picked up" a story vs didn't is very very viral use case as you could imagine people using you as a backdrop to a social post about editorial bias. That said, I think you need to pick who you serve because the folks who will be interested in this aren't the average person as they're not super news focused.

One way to learn may be looking at the types of meta-stories posted about the analysis on media and see how you could support those types of ongoing analysis. Scoring, honestly, is an another really interesting idea. What are publications "for" or "against" based on how they do editorial, and how they bias their headlines, and ledes.

antiochIst•2mo ago

Yea I feel you... Honestly I kinda just whipped this thing up in context of a larger project I'm working on.. so i have not given much thought to who it will serve. "rooting out bias" is interesting idea... But a bit negative in nature, I was more hoping to identify & highlight the original/powerful sources of news...

DivingForGold•2mo ago

It's useless to me because NONE of the titles are hotlinks, plus, you cannot even copy / paste the titles to a browser. The creator has his script set to not allowing copying.

rglynn•2mo ago

Presumably a lot of large organisations have private versions of this. Are there similar projects for this that are available for private individuals, even if paid/closed-source?

badmonster•2mo ago

How does the system visualize the spread of news across different sites? Are there network graphs or timeline visualizations showing propagation?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: MCP App to play backgammon with your LLM

Show HN: Slack CLI for Agents

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

Show HN: ARM64 Android Dev Kit

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: I Hacked My Family's Meal Planning with an App

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: Compile-Time Vibe Coding

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Daily-updated database of malicious browser extensions

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Horizons – OSS agent execution engine

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Show HN: I built a RAG engine to search Singaporean laws

Show HN: Sem – Semantic diffs and patches for Git

Show HN: A password system with no database, no sync, and nothing to breach

Show HN: Craftplan – I built my wife a production management tool for her bakery

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: MCP App to play backgammon with your LLM

Show HN: Slack CLI for Agents

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

Show HN: ARM64 Android Dev Kit

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: I Hacked My Family's Meal Planning with an App

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: Compile-Time Vibe Coding

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Daily-updated database of malicious browser extensions

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Horizons – OSS agent execution engine

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Show HN: I built a RAG engine to search Singaporean laws

Show HN: Sem – Semantic diffs and patches for Git

Show HN: A password system with no database, no sync, and nothing to breach

Show HN: Craftplan – I built my wife a production management tool for her bakery

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

Show HN: Real-time system that tracks how news spreads across 200k websites

Comments