fp.

  Key features:
  - NVFP4 (4-bit) via Transformer Engine - fastest option
  - MXFP8 (8-bit) for higher precision
  - bitsandbytes FP4 fallback for any CUDA GPU
  - ~240MB LoRA adapters instead of ~6GB full models

  Tested on DGX Spark (128GB VRAM). Training SmolLM3-3B takes ~70GB VRAM with NVFP4.