I kept rebuilding the same arXiv scraper at the start of every ML project. After the third time I wrote a dedup pipeline, I automated the whole thing.
The interesting part is that the pipeline is shared; if two people subscribe to
the same topic, they share one crawl and one deduplicated record pool. Happy to talk through the pgvector dedup approach if anyone's curious.
dangerlego5•1h ago
The interesting part is that the pipeline is shared; if two people subscribe to the same topic, they share one crawl and one deduplicated record pool. Happy to talk through the pgvector dedup approach if anyone's curious.