Bonsai: A Local Agentic AI Harness Built Around Small Models
Since last year, I've been teaching a course at UT Southwestern Medical Center on how to build Agentic AI systems and harnesses for specialized domains.
One thing I've noticed is that as companies like OpenAI, Google, and Anthropic continue raising API prices, the cost of running frontier models in the cloud keeps increasing. At the same time, many users are using ChatGPT the same way they used Google years ago: asking questions and looking up information. Most of these use cases simply don't justify paying for GPT-5.5, Opus 4.8, or other expensive flagship models.
That led me to explore a different idea: combining efficient local models with a purpose-built harness that provides tools, memory, and domain-specific skills.
Part of the reason I named this project Bonsai is that I had some interactions with Stanford's Prism Lab. The architecture follows an Agent + Skills + Memory design. Memory is implemented locally using embeddings and SQLite, allowing semantic retrieval through cosine similarity search. This helps compensate for the limited context windows of smaller local models.
I believe this approach can make small models much more capable than their parameter count would suggest.
Although Anthropic has never publicly disclosed the exact size of Claude Sonnet, my analysis suggests it is likely a Mixture-of-Experts (MoE) model with tens of billions of active parameters and hundreds of billions of total parameters.
The active parameters determine how much computation is used during inference, while the total parameters represent the model's stored knowledge. My hypothesis is that a dense thinking model with only tens of billions of parameters can still deliver strong performance if paired with effective harness engineering, specialized tools, memory, and retrieval systems.
If that hypothesis is correct, local models could satisfy the majority of everyday ChatGPT-style use cases without requiring expensive cloud inference.
As a first step, I'm releasing an experimental version of Bonsai.
Bonsai communicates directly with a local Google Chrome instance and provides a collection of browser-oriented tools that allow a local LLM to interact with the web in an agentic fashion. The default model is Google Gemma 4B, although Qwen models can also be used.
(One reason I chose Gemma as the default is that some government agencies and schools in Texas prohibit the use of Chinese open-source models.)
The left side shows the chat interface, while the right side displays the agent operating the browser in real time.
The harness includes many browser-specific tools, including JavaScript injection capabilities that allow the agent to locate page elements, inspect DOM structures, click buttons, fill forms, and perform other browser interactions.
Current features include:
Browser integration
VectorDB-based semantic memory for small-context local models
Custom browser-oriented skills and tools
Local embedding + SQLite memory system
Agentic web navigation
WebRTC-based communication layer (lower-level than MCP)
The current release was compiled for Windows and requires NVIDIA CUDA.
I've also added an Apple Silicon (M-series) Mac version to the same download directory.
The default model is a 4B thinking model because agent workflows benefit significantly from high token throughput. On my test system (Windows 11 + RTX 4090), Bonsai reaches roughly 140 tokens/sec. On an M4 Mac using Metal, I see around 50 tokens/sec.
I'm curious whether others think specialized harness engineering can make small local models practical for everyday AI workflows, rather than relying exclusively on increasingly large cloud-hosted models.
coolwulf•4m ago
One thing I've noticed is that as companies like OpenAI, Google, and Anthropic continue raising API prices, the cost of running frontier models in the cloud keeps increasing. At the same time, many users are using ChatGPT the same way they used Google years ago: asking questions and looking up information. Most of these use cases simply don't justify paying for GPT-5.5, Opus 4.8, or other expensive flagship models.
That led me to explore a different idea: combining efficient local models with a purpose-built harness that provides tools, memory, and domain-specific skills.
Part of the reason I named this project Bonsai is that I had some interactions with Stanford's Prism Lab. The architecture follows an Agent + Skills + Memory design. Memory is implemented locally using embeddings and SQLite, allowing semantic retrieval through cosine similarity search. This helps compensate for the limited context windows of smaller local models.
I believe this approach can make small models much more capable than their parameter count would suggest.
Although Anthropic has never publicly disclosed the exact size of Claude Sonnet, my analysis suggests it is likely a Mixture-of-Experts (MoE) model with tens of billions of active parameters and hundreds of billions of total parameters.
The active parameters determine how much computation is used during inference, while the total parameters represent the model's stored knowledge. My hypothesis is that a dense thinking model with only tens of billions of parameters can still deliver strong performance if paired with effective harness engineering, specialized tools, memory, and retrieval systems.
If that hypothesis is correct, local models could satisfy the majority of everyday ChatGPT-style use cases without requiring expensive cloud inference.
As a first step, I'm releasing an experimental version of Bonsai.
Bonsai communicates directly with a local Google Chrome instance and provides a collection of browser-oriented tools that allow a local LLM to interact with the web in an agentic fashion. The default model is Google Gemma 4B, although Qwen models can also be used.
(One reason I chose Gemma as the default is that some government agencies and schools in Texas prohibit the use of Chinese open-source models.)
Download https://drive.google.com/drive/folders/1YUQ3tmcBSLEyBKLi5JdJ...
Screenshot https://i.imgur.com/9MacuXk.png
The left side shows the chat interface, while the right side displays the agent operating the browser in real time.
The harness includes many browser-specific tools, including JavaScript injection capabilities that allow the agent to locate page elements, inspect DOM structures, click buttons, fill forms, and perform other browser interactions.
Current features include:
Browser integration
VectorDB-based semantic memory for small-context local models
Custom browser-oriented skills and tools
Local embedding + SQLite memory system
Agentic web navigation
WebRTC-based communication layer (lower-level than MCP)
The current release was compiled for Windows and requires NVIDIA CUDA.
I've also added an Apple Silicon (M-series) Mac version to the same download directory.
The default model is a 4B thinking model because agent workflows benefit significantly from high token throughput. On my test system (Windows 11 + RTX 4090), Bonsai reaches roughly 140 tokens/sec. On an M4 Mac using Metal, I see around 50 tokens/sec.
I'm curious whether others think specialized harness engineering can make small local models practical for everyday AI workflows, rather than relying exclusively on increasingly large cloud-hosted models.