frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Open-source LLM cascading, up to 92% cost savings on benchmarks

https://github.com/lemony-ai/cascadeflow
12•saschabuehrle•1mo ago

Comments

saschabuehrle•1mo ago
Hey HN! I'm Sascha, a technical founder who started coding at 9 and spent the last 2 years obsessing over Small Language Models, specifically, how to squeeze every drop of performance from fast, cheap, domain-specific models before touching slow, expensive flagships.

What it does: cascadeflow is an optimization layer that sits between your app/agent and LLM providers, intelligently cascading queries between cheap and expensive models—so you stop paying Opus 4.5 prices for "What's 2+2?"

Why this matters: Most companies I've talked to are running all their AI traffic through flagship models. They're burning 40-70% of their budget on queries that a $0.15/M token model handles just fine, including reasoning tasks and tool calls. But building intelligent routing is genuinely hard. You need quality validation, confidence scoring, format checking, graceful escalation, and ideally domain understanding. Most teams don't have bandwidth to build this infrastructure.

Backstory: After working on projects with JetBrains and IBM on developer tools, I kept seeing the same pattern: teams scaling AI features or agents hit a cost wall. I started prototyping cascading initially just for my own projects. When I saw consistent 60-80% cost reductions without quality loss, I realized this needed to be a proper cost optimization framework.

How it works: Speculative execution with quality validation. We try the cheap or domain-specific model first (auto-detects 15 domains), validate response quality across multiple dimensions (length, confidence via logprobs, format, semantic alignment), and only escalate to expensive models when validation fails. Framework overhead: <2ms.

First integrations: n8n and LangChain. Both connect any two AI Chat Model nodes (cheap drafter + powerful verifier) with domain-aware routing across code, medical, legal, finance, and 11 more domains. Mix Ollama locally with GPT-5 for verification. In n8n, you can watch cascade decisions live in the Logs tab.

Benchmarks: 69% savings on MT-Bench, 93% on GSM8K, 52% on MMLU—retaining 96% of GPT-5 quality. All reproducible in `/tests/benchmarks`.

What makes it different:

- Understands 15 domains out of the box (auto-detection, domain-specific quality validation, domain aware routing) - User-tier and budget-based cascading with configurable model pipelines - Learns and optimizes from your usage patterns - Auto-benchmarks against your available models - Works with YOUR models across 7+ providers (no infrastructure lock-in) - Python + TypeScript with identical APIs - Optional ML-based semantic validation (~80MB model, CPU-only) - Production-ready: streaming, batch processing, tool calling, multi-step reasoning, cost tracking with optional OpenTelemetry export

n8n package: `npm install @cascadeflow/n8n-nodes-cascadeflow`

Would love technical feedback, especially from anyone running AI at scale who's solved routing differently, or n8n power users who can stress-test the integration. What's broken? What's missing?

SamAlarco•1mo ago
this is very cool.
honeydew•1mo ago
The benchmark numbers look strong but MT-Bench/GSM8K are pretty narrow. Have you tested on more open-ended tasks?
saschabuehrle•1mo ago
For open-ended tasks we use embedding similarity + confidence scoring, not just format matching. If the draft response is semantically thin, it escalates. The system also learns from your actual traffic patterns, after a few hundert queries, it knows which query shapes work on which models for your specific use case.
aregnzsdejan•1mo ago
Very real problem, and the focus on validation (not just routing) is the right direction.

How do you handle cases where validation is uncertain or the domain detector is wrong: do you default conservatively, and what false-negative rates are you seeing?

saschabuehrle•1mo ago
Yes, we default conservatively, when in doubt, escalate. A few specifics: Uncertain validation: We combine multiple signals (confidence scores, semantic similarity, format checks...). If any signal is borderline, we escalate. Better to overpay occasionally than return a bad response.

Wrong domain detection: The domain classifier isn't a gate, it selects which validator to apply. If the validator then fails, it escalates regardless. So a misclassified query still gets caught at the validation layer.

False-negative rates (good responses wrongly escalated): ~7-10% at the beginning, depending on domain. We're okay with this, it means slightly higher cost but never compromised quality. The self-learning engine tightens this over time as it sees your actual traffic patterns.

samnji•1mo ago
Nice. Routing is the hard part. Do you have numbers on false accepts vs false escalations? (i.e., how often you keep a bad cheap answer vs unnecessarily jump to the expensive model). Benchmarks are good, but those two rates are what will make or break it in prod.
saschabuehrle•1mo ago
Good question, these are the two metrics we obsess over: False accepts (bad response passed as good): <1% on benchmarks, ~2-3% in production pilots. This is the one that matters, we tune aggressively to keep it low. Every validator errs on the side of escalation.

False escalations (good response unnecessarily escalated): ~7-10% depending on domain. Costs you tokens, but doesn't hurt quality. The self-learning engine reduces this over time as it learns your traffic patterns.

The tradeoff is intentional: we'd rather waste some spend than serve bad answers. In practice, even with the conservative tuning, customers still see 30-60% cost reduction because the baseline.

Satyam2000•1mo ago
This is amazing and absolutely in the right direction. How do you decide, which queries are routed to less expensive models?

92% are super impressive and as with any of the impressive numbers, you have to try to understand what is behind those. Do cost savings come mostly from routing easy queries, or from heavier workloads?

Also, you mention 7-10% false-negative cases. Is this where your validator disagrees with the expensive flagship model? Are there cases where the flagship model is giving worse answers?

Goldman Sachs taps Anthropic's Claude to automate accounting, compliance roles

https://www.cnbc.com/2026/02/06/anthropic-goldman-sachs-ai-model-accounting.html
1•myk-e•59s ago•0 comments

Ai.com bought by Crypto.com founder for $70M in biggest-ever website name deal

https://www.ft.com/content/83488628-8dfd-4060-a7b0-71b1bb012785
1•1vuio0pswjnm7•1m ago•0 comments

Big Tech's AI Push Is Costing More Than the Moon Landing

https://www.wsj.com/tech/ai/ai-spending-tech-companies-compared-02b90046
1•1vuio0pswjnm7•3m ago•0 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
1•1vuio0pswjnm7•5m ago•0 comments

Suno, AI Music, and the Bad Future [video]

https://www.youtube.com/watch?v=U8dcFhF0Dlk
1•askl•7m ago•0 comments

Ask HN: How are researchers using AlphaFold in 2026?

1•jocho12•10m ago•0 comments

Running the "Reflections on Trusting Trust" Compiler

https://spawn-queue.acm.org/doi/10.1145/3786614
1•devooops•15m ago•0 comments

Watermark API – $0.01/image, 10x cheaper than Cloudinary

https://api-production-caa8.up.railway.app/docs
1•lembergs•16m ago•1 comments

Now send your marketing campaigns directly from ChatGPT

https://www.mail-o-mail.com/
1•avallark•20m ago•1 comments

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

https://github.com/joelparkerhenderson/queueing-theory
1•jph•32m ago•0 comments

Show HN: Hibana – choreography-first protocol safety for Rust

https://hibanaworks.dev/
5•o8vm•34m ago•0 comments

Haniri: A live autonomous world where AI agents survive or collapse

https://www.haniri.com
1•donangrey•34m ago•1 comments

GPT-5.3-Codex System Card [pdf]

https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc697c/GPT-5-3-Codex-System-Card-02.pdf
1•tosh•47m ago•0 comments

Atlas: Manage your database schema as code

https://github.com/ariga/atlas
1•quectophoton•50m ago•0 comments

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
2•helloplanets•53m ago•0 comments

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•1h ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•1h ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•1h ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•1h ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
2•basilikum•1h ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•1h ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•1h ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•1h ago•2 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•1h ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•1h ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•1h ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•1h ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•1h ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•1h ago•1 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•1h ago•0 comments