frontpage.

Qwen 3.5 122B-A10B (MoE, ~10B active parameters) running in native NVFP4 on a single RTX PRO 6000 Blackwell GPU. 31 tokens/sec, 89GB VRAM, piecewise CUDA graphs. No multi-GPU, no cloud.

Why this matters: NVIDIA's TRT-LLM explicitly blocks desktop Blackwell from FP4 — the error literally says "FP4 Gemm not supported before Blackwell, nor GeForce Blackwell." The RTX 5090, PRO 6000, and DGX Spark all use SM120 — same FP4 tensor cores as the B100/B200 datacenter chips (SM100). The lock is artificial product segmentation, not a hardware limitation.

CUTLASS 4.2+ already ships SM120 FP4 kernels. They're compiled into vLLM. The problem is purely dispatch logic — Python-level capability checks that only recognize SM100, not SM120.

Setup (vLLM 0.17.0, stable pip install):

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Sehyo/Qwen3.5-122B-A10B-NVFP4 --port 8100 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode": "piecewise"}'

Key gotchas: (1) Do NOT pass --quantization flag, model uses compressed-tensors format and vLLM auto-detects. (2) Full CUDA graphs OOM — use piecewise mode (31 tok/s vs 12 tok/s eager). (3) Python 3.14 breaks numba, stick with 3.13.

Results: 31 tok/s on 1 GPU vs 54 tok/s on 2 GPUs with Q8_0 llama.cpp. Half the hardware, ~60% the speed, ~98% the quality.

The broader point: SM120 and SM100 share the same FP4 tensor core architecture. CUTLASS has the kernels. The frameworks just need to route SM120 to them. A 122B MoE model on a single desktop GPU at 31 tok/s was datacenter-only six months ago.

Relevant issues: vLLM #33416, SGLang #18954, CUTLASS #2800. We're submitting a PR (~10 lines of Python).

Model: https://huggingface.co/Sehyo/Qwen3.5-122B-A10B-NVFP4

AI agents are coming for government. How one big city is letting them in

The Government Told Courts It Could Easily Refund Tariffs. Now It Says It Can't

How to Track Competitor Pricing Changes Automatically

Canadian employment trends in the era of generative artificial intelligence

Show HN: A daily arithmetic puzzle with a hidden Hard Mode

Breaking macOS Screen Time for fun and profit

CIA faces furious backlash after hidden document with potential cure for cancer

SSH Config: The File Nobody Reads

Show HN: Time as the 4th Dimension – What if it emerges from rotational motion?

The internet is being flooded with AI content. How can we tell what is human?

Unified Attestation: open-source alternative to Google Play Integrity

Moltbook: Bot‑Only Network Full of Prompt and Scam Posts Now Monitored

Ultrasound-Responsive Nanoparticles for Biofilm Treatment

Show HN: Quadratic Intelligence Growth from Logarithmic Routing (QIS Protocol)

OpenAI updates privacy policy as ads expand in ChatGPT

Show HN: Self-hosted Chromium engine with 256 parallel stealth sessions

Show HN: ChatShell – 22MB AI Agent with 9 Built-In Tools (Tauri, Not Electron)

Show HN: Marque – MCP/CLI server for persistent agent design identity

AI Is a Microcontractor

AI agents with memory solve problems 2x better (and 5 more papers)

Adobe's OpenPBR BSDF

Notchi: A macOS notch companion that reacts to Claude Code activity in real-time

Show HN: We forked KuzuDB and added concurrent writes for AI agent memory

Show HN: A tiny multiplayer experiment where everyone attacks the same dragon

Anthropic "Philosopher" Amanda Askell's Connection to "Effective Altruism"

Ask HN: Free hosting/cloud providers for free non-profit open source apps?

Doom Counter – between Nostradamus, Gaza and elections

Deepfakes for Code and the Asymmetric Internet

Publisher demands $500 from impersonated author to retract paper

A Brief History of Type Systems and How AI Is Changing the Tradeoffs

Show HN: NVFP4 on Desktop Blackwell – 122B MoE on a Single RTX PRO 6000 31 tok/s