frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

https://github.com/Zyora-Dev/zse
9•zyoralabs•1h ago
I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts.

The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases.

What ZSE does differently:

Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB

Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs

Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM

All benchmarks verified on Modal A100-80GB (Feb 2026)

It ships with:

OpenAI-compatible API server (drop-in replacement)

Interactive CLI (zse serve, zse chat, zse convert, zse hardware)

Web dashboard with real-time GPU monitoring

Continuous batching (3.45× throughput)

GGUF support via llama.cpp

CPU fallback — works without a GPU

Rate limiting, audit logging, API key auth

Install:

----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion):

----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower.

All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0.

Happy to answer questions about the quantization approach, the .zse format design, or the memory efficiency techniques.

Comments

medi_naseri•31m ago
This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.

Will try getting this deployed.

Does cold start timings advertised for a condition where there is no other model loaded on GPUs?

Add dormant user re-engagement email and improve name parsing

1•nishiohiroshi•15s ago•0 comments

Harness engineering: leveraging Codex in an agent-first world

https://openai.com/index/harness-engineering/
1•fmihaila•3m ago•0 comments

The man building Team USA's Olympic bobsleds

https://www.adirondackexplorer.org/community-news/people/lake-placid-man-builds-team-usas-olympic...
1•wrsh07•4m ago•0 comments

Ask HN: Apache prefork under crawler load – main domains OK, subdomains fail

1•PhongSGC•5m ago•0 comments

Schedule Recurring Tasks in Cowork

https://support.claude.com/en/articles/13854387-schedule-recurring-tasks-in-cowork
1•ed•5m ago•0 comments

Apple Foundation Models SDK for Python

https://github.com/apple/python-apple-fm-sdk
1•gok•9m ago•0 comments

Americans Are Leaving the U.S. in Record Numbers

https://www.wsj.com/us-news/americans-leaving-the-us-migration-a5795bfa
3•reaperducer•15m ago•1 comments

What I learned from 14,000 AI agent sessions

https://coasty.ai:443/
1•nkov47as•17m ago•1 comments

Why is your Mac WiFi Slow?

https://whyfi.network/
1•jamesgresql•20m ago•2 comments

Disrupting the Gridtide Global Cyber Espionage Campaign

https://cloud.google.com/blog/topics/threat-intelligence/disrupting-gridtide-global-espionage-cam...
1•0in•20m ago•0 comments

Code Is Cheap – Now What?

https://www.zackaryia.com/blog/2026-02-25/code-is-cheap-now-what/
1•Zackaryia•20m ago•0 comments

How to Thrive as a Remote Worker

https://spectrum.ieee.org/remote-work
1•jnord•21m ago•0 comments

Terra would have collapsed regardless of Jane Street

https://fragileequilibrium.substack.com/p/algorithmic-stablecoins-are-provably
2•brandoncarl•23m ago•0 comments

Could a vaccine prevent dementia? Shingles shot data only getting stronger

https://arstechnica.com/health/2026/02/could-a-vaccine-prevent-dementia-shingles-shot-data-only-g...
4•jnord•23m ago•0 comments

RAM now represents 35 percent of bill of materials for HP PCs

https://arstechnica.com/gadgets/2026/02/ram-now-represents-35-percent-of-bill-of-materials-for-hp...
22•jnord•24m ago•2 comments

Show HN: Automatic Image Localization Pipeline

1•yomwolde•25m ago•0 comments

Building Governed AI Agents – A Practical Guide to Agentic Scaffolding

https://developers.openai.com/cookbook/examples/partners/agentic_governance_guide/agentic_governa...
2•mooreds•26m ago•0 comments

Trying to Access ZFS-Encrypted SSD Following a Kernel Update Failure (2025)

https://forums.linuxmint.com/
1•transpute•28m ago•0 comments

Abandoning Resend.com for Email

https://anukari.com/blog/devlog/abandoning-resend-com-for-email
1•humbledrone•28m ago•0 comments

I hate my company's AI initiatives

https://kilohertz.substack.com/p/i-hate-my-companys-ai-initiatives
3•calobher•29m ago•0 comments

Former U.S. Air Force Pilot Arrested for Providing Services to Chinese Military

https://www.justice.gov/opa/pr/former-us-air-force-pilot-arrested-providing-defense-services-chin...
2•737min•32m ago•0 comments

Mitchell Hashimoto's new way of writing code

https://newsletter.pragmaticengineer.com/p/mitchell-hashimoto
2•JSR_FDED•32m ago•0 comments

Solar Thermal Energy Conversion with a Multilevel Inverter Circuit

https://www.mdpi.com/2673-4591/124/1/27
1•PaulHoule•35m ago•0 comments

Show HN: Relocate.ai – AI reports for families moving abroad

http://relocate-ai.178.156.240.80.sslip.io
1•greenbelt_dev•36m ago•0 comments

Open-Source Ecosystem Whale Falls

https://nesbitt.io/2026/02/21/whale-fall.html
2•transpute•37m ago•0 comments

Show HN: Decision Tree Builder

https://unli.xyz/tools/decision-tree.html
1•xk3•37m ago•0 comments

Show HN: ImageCFN – Analog, Resolution-Independent Image Representation

https://web-demo-ten-navy.vercel.app
2•prof_garlic•40m ago•1 comments

Trump admin halts [$250M of] Medicaid payments to Minnesota over fraud claims

https://www.cnn.com/2026/02/25/politics/trump-vance-minnesota-medicaid
4•Tadpole9181•41m ago•0 comments

DeepSeek withholds latest AI model from Nvidia, AMD

https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nv...
6•cmrdporcupine•42m ago•2 comments

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

https://github.com/Intrect-io/OpenSwarm
3•unohee•48m ago•0 comments