Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

https://github.com/antoinezambelli/forge

26•zambelli•7h ago

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

Comments

zambelli•1h ago

Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.

fabian_shipamax•38m ago

dashboard link is dead

zambelli•33m ago

Does this work? https://github.com/antoinezambelli/forge/tree/main/docs/resu...

schaefer•13m ago

yes, that link works for me.

tommica•33m ago

What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?

zambelli•30m ago

It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [Jason blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.

Big frontier models need this less than small models.

k__•31m ago

So, this basically ensures that models call the right tools with the correct format?

zambelli•29m ago

In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.

k__•27m ago

Ah, so it a "smart" retry mechanism?

zambelli•24m ago

I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).

The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.

jf•23m ago

Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS

zambelli•18m ago

Very tangential! I'll try but it might take me a while.

xiaod•17m ago

I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?

zambelli•5m ago

Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

dpweb•6m ago

Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?

Show HN: Gaussian Splat of a Strawberry

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Show HN: Haystack – Review the PRs that need human attention

Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

Show HN: I made a 3D pose maker for artists

Show HN: Yt-x v0.8.0 – Browse, play, and download YouTube from the terminal

Show HN: Logbox – let Claude monitor your dev logs

Show HN: Id-agent – Token efficient UUID alternative for AI agents

Show HN: Search 67K .AI domains by AI-extracted tags and descriptions

Show HN: Pg_deltax, Apache-licensed alternative to TimescaleDB

Show HN: Number Gacha, a gacha game distilled to its essence

Show HN: Hsrs – Type-Safe Haskell Bindings Generator for Rust

Show HN: Files.md – Open-source alternative to Obsidian

Show HN: How Expensive Is Your (Steam) Wishlist?

Show HN: audio.observer – AI news jingles you didn’t ask for

Show HN: LibreOffice-rs – I built a pure-Rust LibreOffice using autoresearch

Show HN: InsForge – Open-source Heroku for coding agents

Show HN: Autodidact – Self-evolving local-first AI agent

Show HN: Gpubook – An order book for GPU compute

Show HN: We missed Winamp, so we built an audio player for macOS

Show HN: Noxu DB, a Rust Port of Berkeley DB Java Edition

Show HN: Clark-Browser – Stealth Chromium

Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep

Show HN: Barstool, a Prettier macOS Menubar

Show HN: Mezz, a curl-able WiFi sandbox for IoT pentesting

Show HN: Resilient, A composable async resilience toolkit for rust

Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS

Show HN: Spud – cross-platform remote control, optimised for gaming

Show HN: Rocksky – Music scrobbling and discovery on the AT Protocol

Show HN: Watch a neural net learn to play Snake

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Comments

Show HN: Gaussian Splat of a Strawberry

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Show HN: Haystack – Review the PRs that need human attention

Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

Show HN: I made a 3D pose maker for artists

Show HN: Yt-x v0.8.0 – Browse, play, and download YouTube from the terminal

Show HN: Logbox – let Claude monitor your dev logs

Show HN: Id-agent – Token efficient UUID alternative for AI agents

Show HN: Search 67K .AI domains by AI-extracted tags and descriptions

Show HN: Pg_deltax, Apache-licensed alternative to TimescaleDB

Show HN: Number Gacha, a gacha game distilled to its essence

Show HN: Hsrs – Type-Safe Haskell Bindings Generator for Rust

Show HN: Files.md – Open-source alternative to Obsidian

Show HN: How Expensive Is Your (Steam) Wishlist?

Show HN: audio.observer – AI news jingles you didn’t ask for

Show HN: LibreOffice-rs – I built a pure-Rust LibreOffice using autoresearch

Show HN: InsForge – Open-source Heroku for coding agents

Show HN: Autodidact – Self-evolving local-first AI agent

Show HN: Gpubook – An order book for GPU compute

Show HN: We missed Winamp, so we built an audio player for macOS

Show HN: Noxu DB, a Rust Port of Berkeley DB Java Edition

Show HN: Clark-Browser – Stealth Chromium

Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep

Show HN: Barstool, a Prettier macOS Menubar

Show HN: Mezz, a curl-able WiFi sandbox for IoT pentesting

Show HN: Resilient, A composable async resilience toolkit for rust

Show HN: Auto-identity-remove – Automated data broker opt-out runner for macOS

Show HN: Spud – cross-platform remote control, optimised for gaming

Show HN: Rocksky – Music scrobbling and discovery on the AT Protocol

Show HN: Watch a neural net learn to play Snake