So I rebuilt it. Changeflow extracts semantic changes and summarizes them in plain English:
- "FDA posted new adaptive trial guidance (Jan 15)" - "Competitor raised enterprise pricing 12%" - "9th Circuit issued opinion on arbitration agreements"
Instead of "47 pixels changed in the header region."
THE HARD TECHNICAL PROBLEMS
Scraping any URL (not just specific sites)
Unlike scrapers built for Amazon or LinkedIn, users give us any URL and expect it to work. Our approach:
Delayed-attach pattern: launch Chrome, let page load naturally, poll /json endpoint for title+URL stability, only THEN attach Puppeteer. Bot detection scripts run against a clean browser.
Three-tier fallback: Linux + datacenter proxy (90% of sites) -> Linux + mobile proxy (9%) -> macOS + real hardware (1%). Cache successful routes per-URL. Expensive path rarely fires.
Real Chrome, not Chrome for Testing (fingerprint detectable). On real Mac hardware, disable GPU spoofing entirely - genuine beats fake.
LLM costs at scale
Running AI on every fetch gets expensive. We cut costs 90%:
Strip nav/sidebars/footers before AI call (~60% token reduction). Model tiering: Llama 3.1 8B via Groq for extraction, Gemini Flash Lite for summaries, Claude only when quality matters.
Gemini cache trick: 1024+ token system prompts get 90% discount on repeat calls. Verbose prompts are actually cheaper.
Diffing beyond git diff
Git diff isn't enough. We add MD5 hashes to list items for move detection, use Levenshtein distance to distinguish edits from replacements, and clean temporal noise ("2 days ago") that creates false positives.
STACK
Rails + Postgres, Faktory workers, Node.js browser pool, Claude/Gemini/Llama via OpenRouter, Proxies from GridPanel and SquidProxies.
Happy to answer questions about the scraping, AI, or 10 years of lessons in this space.