I benchmarked it against Ollama using the exact same GGUF files on an RTX 4070 Ti SUPER:
GPU (dlgo Vulkan vs Ollama CUDA):
Qwen3.5 0.8B: 239 tok/s vs 187 tok/s — 28% faster Gemma 3 270M: 456 tok/s vs 503 tok/s (−9%) SmolLM2 360M: 420 tok/s vs 451 tok/s (−7%) 10 models tested, within 7–25% of CUDA on standard architectures CPU (dlgo vs Ollama, same GGUF):
6 of 10 models within 9% of Ollama 2 models faster (Gemma 270M +3%, SmolLM2 360M +7%) The Qwen3.5 result surprised me. Qwen3.5 uses a hybrid Gated Delta Net + attention architecture (SSM layers with a recurrent delta rule). I wrote 6 custom Vulkan compute shaders for it — conv1d, delta rule recurrence, L2 normalization, sigmoid gating — and the fused Vulkan pipeline ended up outperforming llama.cpp's CUDA kernels.
Vulkan means this runs on AMD, Intel, and mobile GPUs too — not just NVIDIA. Ollama's own Vulkan backend is 66–126% slower than dlgo on the models I tested.
Supports LLaMA, Qwen2/3/3.5, Gemma, Phi, SmolLM2, Mistral, plus Whisper speech-to-text. 25+ quantization formats (Q4_0 through Q8_0, all K-quants).
Three lines to run:
model, _ := dlgo.LoadLLM("model.gguf") response, _ := model.Chat("", "What is the capital of France?") fmt.Println(response)