frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Open-source LLM cascading, up to 92% cost savings on benchmarks

https://github.com/lemony-ai/cascadeflow
5•saschabuehrle•1h ago

Comments

saschabuehrle•1h ago
Hey HN! I'm Sascha, a technical founder who started coding at 9 and spent the last 2 years obsessing over Small Language Models, specifically, how to squeeze every drop of performance from fast, cheap, domain-specific models before touching slow, expensive flagships.

What it does: cascadeflow is an optimization layer that sits between your app/agent and LLM providers, intelligently cascading queries between cheap and expensive models—so you stop paying Opus 4.5 prices for "What's 2+2?"

Why this matters: Most companies I've talked to are running all their AI traffic through flagship models. They're burning 40-70% of their budget on queries that a $0.15/M token model handles just fine, including reasoning tasks and tool calls. But building intelligent routing is genuinely hard. You need quality validation, confidence scoring, format checking, graceful escalation, and ideally domain understanding. Most teams don't have bandwidth to build this infrastructure.

Backstory: After working on projects with JetBrains and IBM on developer tools, I kept seeing the same pattern: teams scaling AI features or agents hit a cost wall. I started prototyping cascading initially just for my own projects. When I saw consistent 60-80% cost reductions without quality loss, I realized this needed to be a proper cost optimization framework.

How it works: Speculative execution with quality validation. We try the cheap or domain-specific model first (auto-detects 15 domains), validate response quality across multiple dimensions (length, confidence via logprobs, format, semantic alignment), and only escalate to expensive models when validation fails. Framework overhead: <2ms.

First integrations: n8n and LangChain. Both connect any two AI Chat Model nodes (cheap drafter + powerful verifier) with domain-aware routing across code, medical, legal, finance, and 11 more domains. Mix Ollama locally with GPT-5 for verification. In n8n, you can watch cascade decisions live in the Logs tab.

Benchmarks: 69% savings on MT-Bench, 93% on GSM8K, 52% on MMLU—retaining 96% of GPT-5 quality. All reproducible in `/tests/benchmarks`.

What makes it different:

- Understands 15 domains out of the box (auto-detection, domain-specific quality validation, domain aware routing) - User-tier and budget-based cascading with configurable model pipelines - Learns and optimizes from your usage patterns - Auto-benchmarks against your available models - Works with YOUR models across 7+ providers (no infrastructure lock-in) - Python + TypeScript with identical APIs - Optional ML-based semantic validation (~80MB model, CPU-only) - Production-ready: streaming, batch processing, tool calling, multi-step reasoning, cost tracking with optional OpenTelemetry export

n8n package: `npm install @cascadeflow/n8n-nodes-cascadeflow`

Would love technical feedback, especially from anyone running AI at scale who's solved routing differently, or n8n power users who can stress-test the integration. What's broken? What's missing?

honeydew•1h ago
The benchmark numbers look strong but MT-Bench/GSM8K are pretty narrow. Have you tested on more open-ended tasks?
saschabuehrle•57m ago
For open-ended tasks we use embedding similarity + confidence scoring, not just format matching. If the draft response is semantically thin, it escalates. The system also learns from your actual traffic patterns, after a few hundert queries, it knows which query shapes work on which models for your specific use case.

Show HN: New macOS dark pattern to force Tahoe upgrade

1•hexbin010•40s ago•0 comments

Creating psychological safety in the AI era

https://www.technologyreview.com/2025/12/16/1125899/creating-psychological-safety-in-the-ai-era/
1•fleahunter•2m ago•0 comments

PostgreSQL extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch
1•jascha_eng•2m ago•0 comments

Watt-Admin 1.0.0: Capture, Profile, and Share Your Node.js Performance Data

https://blog.platformatic.dev/watt-admin-100-capture-profile-and-share-your-nodejs-performance-data
1•feross•3m ago•0 comments

Full AI Voice Agent (Whisper and 700M LLM and NeuTTS) running offline [video]

https://www.youtube.com/watch?v=9eFf2xlKk-s
1•neuphonic•4m ago•1 comments

Basicode in the Browser

https://robhagemans.github.io/basicode/
1•spzb•4m ago•1 comments

Show HN: Xsql – Convert SQL Schemas Across MySQL, Postgres, and SQLite

https://github.com/Dawaman43/xsql
1•dawitworku•5m ago•1 comments

Bhgrep – Ripgrep for Browser History

https://github.com/jondot/bhgrep
1•jondot•5m ago•0 comments

Show HN: Hindsight Is the New SOTA Memory for AI Agents

https://venturebeat.com/data/with-91-accuracy-open-source-hindsight-agentic-memory-provides-20-20...
1•nicoloboschi•6m ago•0 comments

Show HN: BrowserWing – Turning Browser Actions into MCP Commands

https://github.com/browserwing/browserwing
1•cg33•6m ago•0 comments

Satan Is the Antimeme

https://www.johnnychang.com/antimeme/
1•zcase•6m ago•0 comments

Achieving 20%+ improvement in structured extraction using DSPy and GEPA

https://kmad.ai/DSPy-Optimization
1•kmad•8m ago•0 comments

The JavaScript Bundler Grand Prix

https://redmonk.com/kholterhoff/2025/12/16/javascript-bundler-grand-prix/
1•kholterhoff•8m ago•0 comments

Token Laundering

https://llemre.com/token-laundering/
1•bhu8•8m ago•0 comments

Show HN: VideoReview – Collaborative video review for games and animation

https://github.com/KirisameMarisa/video-review
1•KirisameMarisa•10m ago•0 comments

Show HN: Tokri – a desktop basket for temporary files, text, and images

https://github.com/jarusll/tokri
2•jarusll•12m ago•0 comments

Show HN: PaperDebugger – An Overleaf companion for revising LaTeX papers

https://github.com/PaperDebugger/paperdebugger
2•andrelinhk•13m ago•0 comments

Show HN: Jordle – Japanese Furigana Practice

https://jordle.io
1•qmarchi•14m ago•0 comments

Don't Build a General Purpose API (4 Years Later)

https://max.engineer/server-informed-ui-p2
2•hakunin•14m ago•0 comments

What Makes You Senior

https://terriblesoftware.org/2025/11/25/what-actually-makes-you-senior/
2•kaizenb•14m ago•1 comments

Show HN: There's now a full suite of open hardware available

1•iris-digital•15m ago•0 comments

UK to push for nudity-blocking software on devices to protect children

https://www.ft.com/content/0ef79775-eadf-4cc9-b32c-e97b0eff816f
2•saubeidl•17m ago•0 comments

IRant: Find My Teases ETA

https://joshstrange.com/2025/12/16/irant-find-my-teases-eta/
1•joshstrange•17m ago•0 comments

Show HN: Octopii – runtime for writing distributed applications in Rust

1•puterbonga•17m ago•1 comments

Tor VPN

https://support.torproject.org/tor-vpn/getting-started/about-tor-vpn/
2•ssernikk•18m ago•0 comments

CEOs to Keep Spending on AI, Despite Spotty Returns

https://www.wsj.com/tech/ai/ceos-to-keep-spending-on-ai-despite-spotty-returns-2eaeb6b
2•1vuio0pswjnm7•19m ago•0 comments

Antigravity feels heavy and Claude Skills are light

https://quesma.com/blog/claude-skills-not-antigravity/
1•stared•20m ago•0 comments

Gemini reviewing feedback on its code from another model

https://old.reddit.com/r/ChatGPT/comments/1pmvpvt/i_just_showed_gemini_what_chatgpt_said_about_its/
2•polycaster•21m ago•0 comments

Elon Musk diving into 2026 midterms for the GOP

https://www.axios.com/2025/12/16/elon-musk-donations-midterms-republicans-trump
2•eig•21m ago•2 comments

Lua PageMaker: A Lua-driven multi-column layout engine for LaTeX

https://github.com/sylvainhalle/lua-pagemaker
2•sylvainhalle•22m ago•1 comments