15:I[981,["658","static/chunks/658-a698693f56d0116a.js","183","static/chunks/183-6beeeb801c274aaa.js","474","static/chunks/app/%5Btype%5D/layout-631c4730b83c8223.js"],"Posts"] 16:T7c1,I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts.

The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases.

What ZSE does differently:

Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB

Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs

Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM

All benchmarks verified on Modal A100-80GB (Feb 2026)

It ships with:

OpenAI-compatible API server (drop-in replacement)

Interactive CLI (zse serve, zse chat, zse convert, zse hardware)

Web dashboard with real-time GPU monitoring

Continuous batching (3.45× throughput)

GGUF support via llama.cpp

CPU fallback — works without a GPU

Rate limiting, audit logging, API key auth

Install:

----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion):

----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower.

All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0.<

The geomechanics of hydrogen storage in salt caverns [pdf]

How to make LLM native User Interfaces - Post LLM Workflow

Two Beliefs About Coding Agents: Devs Don't Realize What They Bring

Why are you still using Vercel?

Your Move, Claude

Bioethics Was Forged in Horror. It Can Be Lost the Same Way

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Show HN: Taji – Portfolio advisor that's better than Fidelity's

Therapist's Office Is Designed to Make You Cry

The Texas AI boom is outpacing water regulations

I built a client portal for freelancers after the same conversation arised

Against Query Based Compilers

Discovering Multiagent Learning Algorithms with Large Language Models

Postgres Jsonb Columns and Toast: A Performance Guide

(paper money) Hedge Fund staffed by AI Employees (experiment)

Show HN: Bloomfilter – A service for AI agents to register and manage domains

Examining Bias and AI in Latin America

Show HN: WebMCP Core – AI agent tool definitions from any site

Tell HN: Cursor has an agent CLI, and it's better than Claude Code

Anthropic is dropping its signature safety pledge amid a heated AI race

Eleven Freedoms for Free AI

Average Typing Speeds based on 221k user typing sessions

WTF Happened in 2025?

Dead Internet Theory – A Win?

Open-Source Agent Operating System

RAG on a Budget: How I Replaced a $360/Month OpenSearch Cluster for $1.12/Month

Tech Companies Shouldn't Be Bullied into Doing Surveillance

Honey Fraud as a Moving Analytical Target: Omics-Informed Authentication

Claude Code Video Toolkit

Show HN: Unix for the Commodore 64? Open Source

Show HN: Provision Stateless GPU Compute with Claude Code's Remote Control