Features: - Single .zse file format (model + tokenizer + config embedded) - Zero network calls on load - works completely offline - Dual INT4 kernel backend (ZSE Kernel + ZSE bnb Kernel) - Intelligent layer auto-selects optimal kernel for your hardware - Fast cold starts for serverless deployments
Benchmarks (H200, Qwen 2.5):
ZSE Kernel: 7B → 5.67 GB VRAM, 37 tok/s, 5.7s cold start 14B → 10.08 GB VRAM, 21 tok/s, 10.5s cold start 32B → 19.47 GB VRAM, 11 tok/s, 20.4s cold start 72B → 41.54 GB VRAM, 6 tok/s, 51.8s cold start
ZSE bnb Kernel: 7B → 6.57 GB VRAM, 46 tok/s, 6.0s cold start 14B → 11.39 GB VRAM, 28 tok/s, 7.1s cold start 32B → 22.27 GB VRAM, 20 tok/s, 20.8s cold start 72B → 47.05 GB VRAM, 16 tok/s, 53.0s cold start
Usage: pip install zllm-zse zse convert Qwen/Qwen2.5-7B-Instruct -o model.zse zse serve model.zse --port 8000