Right now the box boots into a very simple web UI where you choose a model and start using it. The API follows the OpenAI format for chat completions and embeddings. It can run different models depending on the hardware you pick, either a Jetson Orin Nano or an x86 mini-PC with a GPU. It stores data locally, supports basic RAG indexing and only exposes itself on the LAN by default.
A few things still aren’t working. There’s no multi-user rate limiting yet. The RAG quality is basic and I’m still improving chunking and reranking. The Orin runs hot under heavy load, so thermal performance needs work. It’s also still a prototype rather than a finished consumer product.
On the technical side, it runs containerized model servers using Ollama and some custom runners. Models load through GGUF or TensorRT-LLM depending on the hardware. The API layer follows the OpenAI spec. The RAG pipeline uses local embeddings and a vector database. The software stack is a mix of TypeScript and Python.
I’m looking for feedback from anyone who has built or deployed local inference before. I’m trying to understand what thermal and power issues you’ve run into, whether a drop-in OpenAI compatible box is actually useful to small teams, what hardware setups I should consider, and any honest critiques of the idea.