NVIDIA has unveiled a new pretraining methodology utilizing the NVFP4 format, which allows for the training of large language models (LLMs) with 4-bit precision. This represents a significant advancement in AI infrastructure, as traditional methods have mostly relied on 8-bit floating point formats. The company's research highlights the successful training of a 12-billion-parameter hybrid Mamba-Transformer across an impressive 10 trillion tokens, marking the longest documented training run at this precision level.
The NVFP4 format, designed for NVIDIA’s Blackwell Tensor Cores, microscales low-precision elements into a single scale factor. This approach enables the model to compress dynamic range while reducing quantization errors, a common challenge in long token horizons. The research team reported that the resulting model achieved a score of 62.58% on the MMLU-Pro 5-shot benchmark, closely following the FP8 baseline's 62.62%.
Understanding NVFP4's Innovations
NVFP4 introduces three core changes to the microscaling format. The block size has been reduced from 32 to 16 elements, allowing for a narrower dynamic range that each scale factor covers. Additionally, NVFP4 uses E4M3 for block scale factors, focusing on mantissa precision over exponent range. This adjustment ensures that maximum values within each block can be mapped closer to the highest representable values in FP4 precision. Lastly, NVFP4 adds a second scaling level, an FP32 per-tensor scale, which keeps the block scales within range.
These advancements enable NVFP4 to achieve remarkable performance on NVIDIA Blackwell, yielding throughput speeds of up to 4× over BF16 on the GB200 and 6× on the GB300. This means approximately double the speed of FP8 while halving the operand memory footprint.
Training Methodology and Stability
NVIDIA also outlined a four-part training methodology essential for effectively using NVFP4. An initial attempt to quantize all linear-layer GEMMs to NVFP4 led to early divergence during training. The team pinpointed four key components to stabilize the training process:
- Selective High Precision: Keeping linear layers in BF16 for the first two and final eight of the 62 blocks ensures sufficient dynamic range during training.
- Random Hadamard Transforms (RHT): This technique spreads outliers in weight gradients, promoting a more Gaussian distribution and thus improving convergence. The choice of a d=16 size for RHT positively impacted the training outcomes.
With these techniques, NVIDIA shows that a significant portion of values in each NVFP4 block can achieve near-FP8 precision, while others remain in FP4. This balance is essential for maintaining stability throughout the training process.
Implications for the Future of AI
The introduction of NVFP4 and its training methodology not only boosts the computational efficiency of training large language models but also sets a standard for future research in quantization techniques. As AI models grow more complex, advancements like these will be critical in ensuring that performance does not sacrifice resource efficiency. The impact of this research reaches beyond NVIDIA, potentially shaping the broader field of AI infrastructure and leading to more efficient training methodologies across the industry.



