It uses Snowflake’s Arctic model for embeddings and HNSW for fast similarity search. Each “story cluster” shows who published first, how fast it propagated, and how the narrative evolved as more outlets picked it up.
Would love feedback on the architecture, scaling approach, and any ways to make the clusters more accurate or useful.
Live demo: https://yandori.io/news-flow/
masterphai•2mo ago
A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.
Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.
Overall, really nice work. The propagation timeline is especially useful.
supriyo-biswas•2mo ago
yieldcrv•2mo ago
alchemist1e9•2mo ago
nextaccountic•2mo ago
maybe the author uses LLMs in some comments and not others. that is, it's not a bot, just someone manually using LLM tools sometimes
wcallahan•2mo ago
‘masterphai’ is evidence of how effective a good LLM and better prompt can be now at evading detection of AI authorship… but there’s no way this authors comments are written by a sane human.
From the comment history it appears it has tricked quite a few humans to-date. Interesting!