fp.

[Technical Explainer] Project PRIMAL: Shadowless 4-Bit TrainingI’ve spent the last few weeks trying to solve the "Shadow Weight Tax." Standard Quantization-Aware Training (QAT) is often a memory lie—it claims to be 4-bit, but it keeps a hidden FP32 master copy to accumulate gradients. This doubles VRAM requirements and kills the dream of training on consumer-grade legacy hardware like the GTX 1080 Ti.Project PRIMAL is an attempt to delete shadow weights entirely and train a 0.1B parameter model directly on a discrete integer grid.1. The Core Hack: The Prime-Harmonic GridInstead of linear INT4, I’ve implemented a 13-value grid derived from prime reciprocals ($\pm 1, \pm 1/2, \pm 1/3 \dots$).The Logic: This concentrates precision around zero, creating a "natural" bell curve that mimics the weight distribution of dense models without the overhead of floating-point math.Efficiency: Stored in 4-bit nibbles, this allows for high-precision "Fine" layers without the memory footprint of FP16.2. The "Poltergeist" OptimizerTo solve Stochastic Thrashing (the oscillation caused by gradients being smaller than the discrete weight steps), I developed Decoupled Flipping:Vote Buffering: Gradients cast a "Vote" into an int8 buffer rather than touching the weights.Consensus: We only flip a discrete bit once the buffer shows a strong net signal (e.g., +4 net votes over multiple micro-batches).Adaptive Probability: Weight magnitude determines "flip-resistance," stabilizing the model as it converges.3. Telemetry on a GTX 1080 Ti (11GB):Throughput: ~5,800 - 6,000 Tokens/Sec.VRAM Health: 10.37 GB (94% saturation) with zero leakage over 18+ hours.Semantic Emergence: The model has progressed from random noise to forming complex concepts like "evaluate video system damages" at Step 11,650.4. Process & AI DisclosureI am a solo researcher using AI as a force multiplier. I collaborated with Gemini 3 Flash for CUDA kernel refinement, documentation structuring, and logic stress-testing. The "Poltergeist" consensus logic and the Prime-Harmonic math are the core of the research, while AI assisted in accelerating the low-level implementation.