I built a toolkit to fine-tune LLMs using LoRA + native 4-bit quantization on NVIDIA's new Blackwell GPUs (DGX Spark with GB10).
Key features:
- NVFP4 (4-bit) via Transformer Engine - fastest option
- MXFP8 (8-bit) for higher precision
- bitsandbytes FP4 fallback for any CUDA GPU
- ~240MB LoRA adapters instead of ~6GB full models
Tested on DGX Spark (128GB VRAM). Training SmolLM3-3B takes ~70GB VRAM with NVFP4.
waybarrios•2h ago