The scale problem first: Pipedream has ~10,000 actions.
Full catalog = 750K tokens. GPT-4o context = 128K.
The LLM literally cannot load the tools.
We inverted the architecture.
LLM runs once, offline, at build time — generates every
possible way a human might phrase an intent. 22,614
exemplars compiled into an 8.5MB HDC vector space.
At runtime: pure math, no LLM, 7ms.
Results across 85,125 test queries:
First-pass action accuracy: 89.6%
Action accuracy (with ASK): 100%
App accuracy: 100%
Silent errors: 0%
Latency p95: <13ms
Tokens per query: 0
Model size: 8.5MB, no GPU
Improves with use: Yes
The 10.4% that trigger ASK aren't failures. The system
asks rather than guesses, the correct action is always
in the candidate set, and every resolved ASK strengthens
the model via Hebbian reinforcement. No retraining.
No labeling pipeline. The production model is the
learning model.
GPT-4o hits 98.5% accuracy — when given a pre-filtered
shortlist of 200 actions and human-readable action keys.
It can't do app selection across 3,146 apps. HDC does
the whole thing in 7ms and gets better with every use.
We benchmarked honestly — full methodology in the paper
including where GPT-4o wins.
Patent pending: US 63/969,729
Covers: build-time LLM→HDC pipeline, confidence gating,
Hebbian self-improvement without retraining.
timmetime•8h ago
The scale problem first: Pipedream has ~10,000 actions. Full catalog = 750K tokens. GPT-4o context = 128K. The LLM literally cannot load the tools.
We inverted the architecture.
LLM runs once, offline, at build time — generates every possible way a human might phrase an intent. 22,614 exemplars compiled into an 8.5MB HDC vector space. At runtime: pure math, no LLM, 7ms.
Results across 85,125 test queries:
The 10.4% that trigger ASK aren't failures. The system asks rather than guesses, the correct action is always in the candidate set, and every resolved ASK strengthens the model via Hebbian reinforcement. No retraining. No labeling pipeline. The production model is the learning model.GPT-4o hits 98.5% accuracy — when given a pre-filtered shortlist of 200 actions and human-readable action keys. It can't do app selection across 3,146 apps. HDC does the whole thing in 7ms and gets better with every use.
We benchmarked honestly — full methodology in the paper including where GPT-4o wins.
Patent pending: US 63/969,729 Covers: build-time LLM→HDC pipeline, confidence gating, Hebbian self-improvement without retraining.
White paper + benchmark + Docker quickstart: https://github.com/glyphh-ai/model-pipedream