Frontier Models May 21 ago

NVIDIA’s GB200 NVL72 Achieves Exascale Performance with New Job Scheduling Techniques

NVIDIA's GB200 NVL72 leverages topology-aware job scheduling to maximize GPU performance, achieving substantial gains in AI model training and inference.

GPUBeat Desk

Desk · GPUBeat Media

Published

May 21 · 18:04 ET

Reading

2 min · 446 words

Virtuals — ai-infrastructure — Virtuals, NVIDIA — NVIDIA’s GB200 NVL72 Achieves Exascale Performance with New Job Scheduling Techniques Source: GPUBeat

NVIDIA's GB200 NVL72 has achieved exascale performance capabilities within a single rack by implementing advanced job scheduling techniques tailored for its architecture. This innovation enables the execution of real-time trillion-parameter models, boosting the efficiency and speed of AI workloads.

The Power of GB200 NVL72

Central to this achievement is the GB200 NVL72, which features 72 NVIDIA Blackwell GPUs interconnected via the high-speed NVLink fabric. This configuration provides an impressive 130 terabytes per second of low-latency communication bandwidth, delivering exceptional performance for artificial intelligence and high-performance computing tasks. Recent benchmarks show that the GB200 NVL72 can deliver over 2.6 times improvement in training performance compared to its predecessors, proving its ability to handle a range of AI workloads, including real-time inference and reasoning applications.

Importance of Topology-aware Scheduling

To maximize the GB200 NVL72's capabilities, effective job scheduling is key. Traditional scheduling methods often result in resource fragmentation when managing multiple jobs within a shared cluster. The introduction of topology-aware scheduling tackles this problem by optimizing resource allocation according to the physical network layout of the cluster. This method keeps workloads within the same NVLink domain, maximizing the advantages of the available networking bandwidth.

The longstanding Slurm topology/tree plugin provided basic scheduling capabilities, but it frequently led to job fragmentation across network switches. While this was somewhat manageable with legacy InfiniBand systems, it fell short for modern rack-scale architectures like the GB200 NVL72. To resolve this, NVIDIA, in partnership with SchedMD, has introduced a new topology/block plugin in Slurm 23.11, specifically designed for high-performance systems.

New Scheduling Strategies

The topology/block plugin significantly improves job scheduling by offering detailed information about node groupings within the same NVL72 domain. This enhancement allows for better job alignment with domain boundaries, effectively reducing resource fragmentation and increasing overall system efficiency. With this capability, Slurm can accommodate the varying bandwidth demands of multiple concurrent training jobs, making it an essential tool for optimizing GPU occupancy in shared environments.

As AI models become more complex and larger in scale, the integration of advanced scheduling techniques, such as those implemented with the GB200 NVL72, will be critical for efficient resource utilization. The ongoing collaboration between hardware manufacturers like NVIDIA and software developers like SchedMD demonstrates a strong commitment to advancing AI performance, making sure that future computing resources can meet the needs of advanced technology.

This shift toward topology-aware scheduling not only enhances performance across current workloads but also lays the groundwork for more ambitious AI projects ahead. With infrastructure like the GB200 NVL72 and its supporting scheduling tools, the AI community can expect significant advancements in both research and application domains, paving the way for innovative solutions that were once out of reach.

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.

2033 stories

The Power of GB200 NVL72

Importance of Topology-aware Scheduling

New Scheduling Strategies

GPUBeat Desk

More on frontier models

Infratil CEO Highlights Untapped Data Center Potential in ANZ

Anthropic’s Olah Calls for Broader Oversight in AI Development

SK Telecom Partners with Defense Ministry to Advance AI in Military