These are only so useful in a multi-turn conversation but it's still interesting to see what you can pack in a <250mb model.
I tried using ONNX versions earlier, but there were too many quirks of using them with language models and the TPS wasn't too impressive. Inspired by svenflow/webgpu-gemma, I put my codex and claude to the task of writing WGSL to run inference for GGUF versions of these models.
Once you load this website and a model, it should load offline too, until your browser evicts the model from the cache.