It's quite different from llama.cpp, vLLM, or ZINC for example.
Written in Zig, and super clean, with no dependencies apart from Vulkan Compute itself of course.
While it supports multiple model families and features like an OpenAI compatibility server, a big thing is you can run it with strict time budgets running inside an existing Vulkan host.. think video games, AR/VR apps, edge devices, or robots.
In addition it supports a rich probe interface to research language model internals at close to real time.
I'm not trying to take on existing runtimes, those exist already. What I am trying to do is help inference work cooperatively with time sensitive applications.
Would love it if you checked it out.
Please let me know if you have any questions or ideas. Thanks!