Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

https://github.com/Zyora-Dev/zse

22•zyoralabs•3h ago

I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts.

The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases.

What ZSE does differently:

Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB

Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs

Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM

All benchmarks verified on Modal A100-80GB (Feb 2026)

It ships with:

OpenAI-compatible API server (drop-in replacement)

Interactive CLI (zse serve, zse chat, zse convert, zse hardware)

Web dashboard with real-time GPU monitoring

Continuous batching (3.45× throughput)

GGUF support via llama.cpp

CPU fallback — works without a GPU

Rate limiting, audit logging, API key auth

Install:

----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion):

----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower.

All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0.

Happy to answer questions about the quantization approach, the .zse format design, or the memory efficiency techniques.

Comments

medi_naseri•1h ago

This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.

Will try getting this deployed.

Does cold start timings advertised for a condition where there is no other model loaded on GPUs?

Google API keys weren't secrets, but then Gemini changed the rules

Jimi Hendrix was a systems engineer

First Website (1992)

RAM now represents 35 percent of bill of materials for HP PCs

How will OpenAI compete?

Making MCP cheaper via CLI

The Pleasures and Pains of Coffee (1830)

Artist who “paints” portraits on glass by hitting it with a hammer

Windows 11 Notepad to support Markdown

Gauss's Weekday Algorithm, Visualized

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Bus stop balancing is fast, cheap, and effective

PA bench: Evaluating web agents on real world personal assistant workflows

Show HN: Respectify – A comment moderator that teaches people to argue better

Large-Scale Online Deanonymization with LLMs

Tech companies shouldn't be bullied into doing surveillance

Self-improving software won't produce Skynet

The First Fully General Computer Action Model

The Om Programming Language

An autopsy of AI-generated 3D slop

Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

Learnings from 4 months of Image-Video VAE experiments

Launch HN: TeamOut (YC W22) – AI agent for planning company retreats

Quasi-Zenith Satellite System

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

GNU Texmacs

Show HN: I ported Tree-sitter to Go

The Hydrogen Truck Problem Isn't the Truck

Access to a Shared Unix Computer

The Misuses of the University

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Comments

Google API keys weren't secrets, but then Gemini changed the rules

Jimi Hendrix was a systems engineer

First Website (1992)

RAM now represents 35 percent of bill of materials for HP PCs

How will OpenAI compete?

Making MCP cheaper via CLI

The Pleasures and Pains of Coffee (1830)

Artist who “paints” portraits on glass by hitting it with a hammer

Windows 11 Notepad to support Markdown

Gauss's Weekday Algorithm, Visualized

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Bus stop balancing is fast, cheap, and effective

PA bench: Evaluating web agents on real world personal assistant workflows

Show HN: Respectify – A comment moderator that teaches people to argue better

Large-Scale Online Deanonymization with LLMs

Tech companies shouldn't be bullied into doing surveillance

Self-improving software won't produce Skynet

The First Fully General Computer Action Model

The Om Programming Language

An autopsy of AI-generated 3D slop

Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

Learnings from 4 months of Image-Video VAE experiments

Launch HN: TeamOut (YC W22) – AI agent for planning company retreats

Quasi-Zenith Satellite System

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

GNU Texmacs

Show HN: I ported Tree-sitter to Go

The Hydrogen Truck Problem Isn't the Truck

Access to a Shared Unix Computer

The Misuses of the University