I built this because I wanted to add local LLM inference to a Go project without shelling out to Python or linking against llama.cpp. The whole thing is go get github.com/computerex/dlgo and you're running models.
It supports LLaMA, Qwen 2/3/3.5, Gemma 2/3, Phi-2/4, SmolLM2, Mistral, and Whisper speech-to-text. Architectures are expressed as a declarative per-layer spec resolved at load time, so adding a new model family is mostly just describing its layer structure rather than writing a new forward pass.
Performance on a single CPU thread with Q4_K_M quantization: ~31 tok/s for LLaMA 3.2 1B, ~48 tok/s for Qwen3 0.6B, ~16 tok/s for Qwen3.5 2B (which has a hybrid attention + Gated Delta Network architecture). Not going to beat llama.cpp on raw speed, but it's fast enough to be useful and the ergonomics of a native Go library are hard to beat.
Supports 25+ GGML quantization formats (Q4_0 through Q8_0, all K-quants, I-quants, F16, BF16, F32). The GGUF parser, dequantization, tokenizer, forward pass, and sampling are all implemented from scratch.