We have been working on how to track Y Combinator Companies (and other incubators for that matter) and I wanted to share a few learnings:
Identifying them in the first place is easy; even with a simple scraper you can save them to Notion, Google Sheets etc. Firecrawl or Exa are pretty good for that
The hard part is keeping the list current and "actionable":
you need a way to cluster startups; most VCs (& their thesis) are sector-driven. So a sector-taxonomy is ideal. But classifying startups into sectors is extremely difficult; you cannot rely on official documentation as it either does not exist or is too vague.
you cannot rely on the startup name as the unique identifier; it can change, it might not be unique
Startups pivot; so you need to keep your lists up-to-date
you need to keep your database up-to-date; can't rely on news as information is scattered all over the place, can't rely on generic alerts, you need signals like "two new use cases on website"
What we have seen to work
use two types of taxonomies: a broad one (NACE codes are a great way to start) and an AI-based one (use a simple classifier based on the company website)
use the domain as the unique identifier instead of the name
monitor live signals; that's usually the hardest part. The simplest (but also most expensive way) is to capture certain pages (about us, career pages etc.) regularly and use AI to get a "diff"
usually Google Sheets is enough to start, but move quickly to a more stable database like Notion or Airtable (CRMs work too, but tend to be too overloaded)
use N8N to glue all that together with a few simple prompts
If there's interest, I can also share a technical breakdown and N8N files to start.
leo_researchly•1h ago
Identifying them in the first place is easy; even with a simple scraper you can save them to Notion, Google Sheets etc. Firecrawl or Exa are pretty good for that
The hard part is keeping the list current and "actionable":
you need a way to cluster startups; most VCs (& their thesis) are sector-driven. So a sector-taxonomy is ideal. But classifying startups into sectors is extremely difficult; you cannot rely on official documentation as it either does not exist or is too vague. you cannot rely on the startup name as the unique identifier; it can change, it might not be unique Startups pivot; so you need to keep your lists up-to-date you need to keep your database up-to-date; can't rely on news as information is scattered all over the place, can't rely on generic alerts, you need signals like "two new use cases on website" What we have seen to work
use two types of taxonomies: a broad one (NACE codes are a great way to start) and an AI-based one (use a simple classifier based on the company website) use the domain as the unique identifier instead of the name monitor live signals; that's usually the hardest part. The simplest (but also most expensive way) is to capture certain pages (about us, career pages etc.) regularly and use AI to get a "diff" usually Google Sheets is enough to start, but move quickly to a more stable database like Notion or Airtable (CRMs work too, but tend to be too overloaded) use N8N to glue all that together with a few simple prompts If there's interest, I can also share a technical breakdown and N8N files to start.
PS: you can read our full breakdown (in German, however) on your blog: https://www.researchly.at/post/y-combinator-companies-finden...