Skip to main content
GPUBeat Frontier Models NVIDIA Breaks New Ground With 4-Bit…

NVIDIA Breaks New Ground With 4-Bit LLM Pretraining Methodology

NVIDIA's new NVFP4 methodology for training large language models at 4-bit precision achieves near-FP8 performance, setting a new standard in AI infrastructure.

Near AI — ai-infrastructure — Near AI, NVIDIA
NVIDIA Breaks New Ground With 4-Bit LLM Pretraining Methodology Source: GPUBeat

NVIDIA has unveiled a new pretraining methodology utilizing the NVFP4 format, which allows for the training of large language models (LLMs) with 4-bit precision. This represents a significant advancement in AI infrastructure, as traditional methods have mostly relied on 8-bit floating point formats. The company's research highlights the successful training of a 12-billion-parameter hybrid Mamba-Transformer across an impressive 10 trillion tokens, marking the longest documented training run at this precision level.

The NVFP4 format, designed for NVIDIA’s Blackwell Tensor Cores, microscales low-precision elements into a single scale factor. This approach enables the model to compress dynamic range while reducing quantization errors, a common challenge in long token horizons. The research team reported that the resulting model achieved a score of 62.58% on the MMLU-Pro 5-shot benchmark, closely following the FP8 baseline's 62.62%.

Understanding NVFP4's Innovations

NVFP4 introduces three core changes to the microscaling format. The block size has been reduced from 32 to 16 elements, allowing for a narrower dynamic range that each scale factor covers. Additionally, NVFP4 uses E4M3 for block scale factors, focusing on mantissa precision over exponent range. This adjustment ensures that maximum values within each block can be mapped closer to the highest representable values in FP4 precision. Lastly, NVFP4 adds a second scaling level, an FP32 per-tensor scale, which keeps the block scales within range.

These advancements enable NVFP4 to achieve remarkable performance on NVIDIA Blackwell, yielding throughput speeds of up to 4× over BF16 on the GB200 and 6× on the GB300. This means approximately double the speed of FP8 while halving the operand memory footprint.

See also  CoreWeave Options Trading Surges with 198K Contracts on May 18

Training Methodology and Stability

NVIDIA also outlined a four-part training methodology essential for effectively using NVFP4. An initial attempt to quantize all linear-layer GEMMs to NVFP4 led to early divergence during training. The team pinpointed four key components to stabilize the training process:

  1. Selective High Precision: Keeping linear layers in BF16 for the first two and final eight of the 62 blocks ensures sufficient dynamic range during training.
  2. Random Hadamard Transforms (RHT): This technique spreads outliers in weight gradients, promoting a more Gaussian distribution and thus improving convergence. The choice of a d=16 size for RHT positively impacted the training outcomes.

With these techniques, NVIDIA shows that a significant portion of values in each NVFP4 block can achieve near-FP8 precision, while others remain in FP4. This balance is essential for maintaining stability throughout the training process.

Implications for the Future of AI

The introduction of NVFP4 and its training methodology not only boosts the computational efficiency of training large language models but also sets a standard for future research in quantization techniques. As AI models grow more complex, advancements like these will be critical in ensuring that performance does not sacrifice resource efficiency. The impact of this research reaches beyond NVIDIA, potentially shaping the broader field of AI infrastructure and leading to more efficient training methodologies across the industry.

GD

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.