Show HN: Director-AI – token-level NLI+RAG

1•anulum•1h ago

Hey HN,

After watching too many agents confidently lie in production, I built Director-AI.

It sits between your LLM and the user, scoring every generated token with: • 0.6× DeBERTa-v3 NLI (contradiction detection) • 0.4× RAG against your own ChromaDB knowledge base

If coherence < threshold → Rust kernel halts the stream before the token is sent.

Key technical bits: • Works with any OpenAI-compatible endpoint (Ollama, vLLM, llama.cpp, Groq, OpenAI, Claude…) • StreamingKernel + windowed scoring • GroundTruthStore.add() for easy fact ingestion • Dual licensing: AGPL open + commercial (closed-source/SaaS OK)

Honest AggreFact numbers inside (66.2% balanced acc with streaming enabled). Not claiming SOTA on static NLI — the value is in the live gating + custom KB system.

Repo + full examples: https://github.com/anulum/director-ai

Would love feedback on the scoring weights, halt logic, or kernel design. What hallucination problems are you solving today?

Comments

soletta•1h ago

Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.

anulum•1h ago

@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks

Will A.I. Take Away Our Basic Skills?

Show HN: Free online audio translator that translates voice instantly

Plugin to give Claude Code perception (screen, system audio and mic context)

Show HN: Squidy – How I stopped losing AI agent context mid-project

Show HN: Easyemailfinder.com (5 Free Credits)

The Internet Was Weeks Away from Disaster and No One Knew [video]

Tesla Lab – 20 computational experiments

Show HN: NovelStar – a functional novel writing suite in a single HTML file

Claude Code Anywhere

Detecting AI scammers and bringing back the control to humans

I hacked ChatGPT and Google's AI – and it only took 20 minutes

RSA-signed prompt envelopes for OpenClaw agents

Connectors: Discord, Notion, and Slack Now Wired into Every Debate

A Computational Perspective on NeuroAI and Synthetic Biological Intelligence

A faithful, native Windows Notepad clone built in Zig using raw Win32 APIs

Optimism Engine – The first AI engine with a deterministic Safety Layer

Worried Europeans can now cut Azure's phone cord completely

Show HN: Marcus –AI math tutor that guides you to answers instead of giving them

Show HN: I built a persistent LSM-Tree storage engine in Go from scratch

Human brain cells playing Doom

Add repo line count to coverage drip emails

I don't know how you get here from "predict the next word."

A high-quality OSS graphical session manager and dashboard for pi.dev agent

Show HN: AI-assert – Constraint verification for LLM outputs (278 lines, Python)

US farmers are rejecting multimillion-dollar datacenter bids for their land

Show HN: Prince Cloud – Create PDFs with AI Agents

What I Saw Inside Apple's U.S. Chip Supply Chain

Apple Needs to Copy Samsung's New Security Smartphone Screen ASAP

Stop babysitting your AI. OpenKoi iterates

Hacker Used Anthropic's Claude to Steal Sensitive Mexican Government Data