Cost for a 50-item research pipeline: $0.15-0.40 vs $8-15 all-cloud. Same output quality where it matters.
Stack: RTX 5080 laptop, Ollama in Docker with GPU passthrough, PostgreSQL, Redis, Claude API for the final 20%.
The pattern: scan locally → score locally → deduplicate locally → synthesize via cloud. Four stages, three are free.
Gotchas I hit: Qwen3's thinking tokens through /api/generate (use /api/chat instead), Docker binding to IPv4 only while Windows resolves localhost to IPv6, and GPU memory ceilings on consumer cards.
Happy to share architecture details in comments.
solomatov•1h ago
Raywob•1h ago
For synthesis and judgment — no, it's not close. That's exactly why I route those stages to Claude. When you need the model to generate novel connections or strategic recommendations, the quality gap between 8B and frontier is real.
The key insight is that most pipeline stages don't need synthesis. They need pattern matching. And that's where the 95% cost savings live.