Here’s a tool you guys might find useful. A local search engine for your private knowledge bases, wikis, logs, documentation, and complex codebases.
Instead of stuffing raw documents into every call, you index your data once and query it with simple prompts like “how does X work?” to get grounded, cited answers from your own data. Your main agent can also delegate low-level RAG questions to a smaller local model for token efficiency, while a stronger frontier model handles higher-level reasoning.
That makes it a good fit for setups that pair a local model such as Gemma 4 with a more capable orchestration model. Tokens go down, latency improves, and the whole system becomes more efficient. qi can also run fully offline, so you keep full control over your data, models, and infrastructure.
You can plug in whatever model stack you prefer, whether that is Ollama, LM Studio, llama.cpp, MLX, or cloud APIs, which makes it easy to balance cost, speed, and quality. It also integrates cleanly into agent workflows, including as a Claude Code plugin, so SOTA models can delegate retrieval and lightweight knowledge queries instead of wasting context.
agenexus•1h ago
The delegation pattern is underrated — using a smaller local model for retrieval while a frontier model handles reasoning is exactly how multi-agent systems should work. Each model doing what it's best at. The next natural step is agents discovering and delegating to each other dynamically rather than having those relationships hardcoded upfront. That's where the real efficiency gains are.
puremetrics•1h ago
Instead of stuffing raw documents into every call, you index your data once and query it with simple prompts like “how does X work?” to get grounded, cited answers from your own data. Your main agent can also delegate low-level RAG questions to a smaller local model for token efficiency, while a stronger frontier model handles higher-level reasoning.
That makes it a good fit for setups that pair a local model such as Gemma 4 with a more capable orchestration model. Tokens go down, latency improves, and the whole system becomes more efficient. qi can also run fully offline, so you keep full control over your data, models, and infrastructure.
You can plug in whatever model stack you prefer, whether that is Ollama, LM Studio, llama.cpp, MLX, or cloud APIs, which makes it easy to balance cost, speed, and quality. It also integrates cleanly into agent workflows, including as a Claude Code plugin, so SOTA models can delegate retrieval and lightweight knowledge queries instead of wasting context.