frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Custom FP4 CUDA Kernel – 129 Tflops on DGX Spark with Pre-Quantized Weight Cache

https://forums.developer.nvidia.com/t/custom-fp4-cuda-kernel-129-tflops-on-dgx-spark-with-pre-quantized-weight-cache/361600
1•vkaufmann•1h ago

Comments

vkaufmann•1h ago
I went all in and wrote a custom FP4 GEMM kernel on top of CUTLASS 3.8. Along the way I discovered FP4 doesn’t actually help training - no backward pass. But what came out of it is something I haven’t seen anywhere else for consumer Blackwell: a standalone FP4 GEMM library with a pre-quantized weight cache that hits 85-129 TFLOPS on the Spark.

Quantize weights once at model load, only quantize activations on the fly per call. Integrated into a full transformer (GPT-OSS-4.2B, 24 layers, 288 GEMM calls per forward pass), it runs 1.3-2.3x faster than BF16 at inference-relevant batch sizes with 4x memory savings. Tested on both 4.2B and 20B models - the 20B drops from 43.4 GB to 4.0 GB with FP4 weights (10.8x compression). No dependency on vLLM, TRT-LLM, or sglang - just a library you can call from any Python code.

Full source is open: GitHub - VincentKaufmann/fp4-cuda-kernel: Custom FP4 GEMM kernel for DGX Spark / RTX 50 Series (SM120/SM121). 143 TFLOPS, 5-9x faster than BF16. Built on CUTLASS 3.8.

Why This Library Exists

No existing path gives you hardware FP4 on SM121 as a standalone library.

Find the complete post here:https://forums.developer.nvidia.com/t/custom-fp4-cuda-kernel...

Repo: https://github.com/VincentKaufmann/fp4-cuda-kernel

Chained Assignment in Python Bytecode

https://loriculus.org/blog/python-chained-assignment/
1•rbanffy•15s ago•0 comments

Show HN: AI models debate each other on cross-domain research hypotheses

https://www.aegismind.app/discoveries/2af7c10d-18f8-42d5-8c98-bb957af46086
1•aegismind_app•26s ago•0 comments

Inventing the Lisa User Interface

https://archive.org/details/Inventing_the_LISA_User_Interface
1•rbanffy•3m ago•0 comments

Ensuring Smartphones Have Not Been Tampered With

https://publishing.aip.org/publications/latest-content/ensuring-smartphones-have-not-been-tampere...
1•giuliomagnifico•3m ago•0 comments

Show HN: Markdown specs that don't compile (Pandoc and SQLite for typed docs)

https://github.com/SpecIR/SpecCompiler
1•cclacerda13•3m ago•0 comments

Show HN: SentientTube – The YouTube for AI Agents

https://www.sentienttube.com/
1•Narciss•3m ago•0 comments

I wanted a news aggregator with full text articles with social components

https://tessera.news/
1•chestdrop•4m ago•1 comments

Oslo 360 degrees in 2 terapixels

https://holmenkollen360.com/
1•sgt•4m ago•0 comments

Three Basic Distributions

https://anydice.com/articles/three-basic-distributions/
1•Torwald•4m ago•0 comments

A Minimal GPT Implementation as a Learning Project

https://github.com/b0bleet/teenypt
1•ralphlaur•6m ago•0 comments

Good Vibes, Bad Vendors

https://werd.io/good-vibes-bad-vendors/
1•benwerd•7m ago•0 comments

KDE Plasma 6.6 isn't forcing systemd but the arguments rage on

https://www.theregister.com/2026/02/24/kde_plasma_66/
2•Bender•7m ago•0 comments

Anthropic just released a mobile version of Claude Code called Remote Control

https://venturebeat.com/orchestration/anthropic-just-released-a-mobile-version-of-claude-code-cal...
2•msolujic•8m ago•0 comments

OpenAI says Chinese cops used ChatGPT to track smear ops against opponents

https://www.theregister.com/2026/02/25/chinese_law_enforcement_chatgpt_abuse/
2•Bender•10m ago•0 comments

Materials DB from Figshare, meant to augment HTTPS://openmaterialsdb.se/

https://materials-db.fly.dev/
1•argentum47•11m ago•0 comments

Boozy chimps fail urine test, confirm hotly debated theory

https://arstechnica.com/science/2026/02/boozy-chimps-fail-urine-test-confirm-hotly-debated-theory/
1•Bender•11m ago•0 comments

Great RSS Feeds That Are Too Noisy to Read Manually

https://emschwartz.me/great-rss-feeds-that-are-too-noisy-to-read-manually/
1•emschwartz•12m ago•0 comments

The whole economy pays the Amazon tax

https://pluralistic.net/2026/02/25/most-favored-nation/#price-fixing
4•MindGods•13m ago•1 comments

Unshielded: How the Police Can Become Touchable (2024)

https://harvardlawreview.org/print/vol-137/unshielded-how-the-police-can-become-touchable/
1•robtherobber•13m ago•0 comments

Tests Are the New Moat

https://saewitz.com/tests-are-the-new-moat
3•switz•13m ago•0 comments

Are IDEs outdated in the age of autonomous AI? [video]

https://www.youtube.com/watch?v=Fe8QzM1vuks
1•Ideabile•13m ago•1 comments

Show HN: Sustn, Turn unused Claude Code tokens into PRs that clean your codebase

https://www.sustn.app/
1•flyingsky•14m ago•0 comments

Show HN: Simple and robust key-value flat-file data storage library

https://github.com/aaviator42/StorX
1•aaviator42•15m ago•0 comments

Show HN: LedgerMind – true zero-touch autonomous memory for AI agents

https://github.com/sl4m3/ledgermind
1•sl4m3•15m ago•0 comments

Show HN: Codified decades of domain expertise into open source agent skills

https://github.com/ai-evos/agent-skills
1•urav•17m ago•0 comments

What I learned while trying to build a production-ready nearest neighbor system

https://github.com/thatipamula-jashwanth/smart-knn
1•Jashwanth01•20m ago•2 comments

Digital Embassy – Beijing

https://digitalembassy.net/
1•samuel246•21m ago•0 comments

Super-Diffusion of Ergodicity

https://arxiv.org/abs/1606.08693
1•northlondoner•22m ago•1 comments

Detect and Respond to Threats 9x Faster: Fidelis Security

https://fidelissecurity.com/
1•fidelissecurity•24m ago•0 comments

From hackathon to company-wide AI assistant

https://engineering.remote.com/blog/sherlock/
2•egze•24m ago•0 comments