NVIDIA has unveiled a new benchmarking tool that addresses a long-standing gap in inference performance metrics for coding agents. Traditional benchmarks primarily measure a single user's interaction with a dedicated endpoint, often leading to inflated performance figures that do not accurately represent real-world conditions. In practical applications, particularly in coding environments, multiple requests occur simultaneously, competing for resources like memory bandwidth and GPU cycles. This new approach aims to evaluate how systems perform under load, providing insights essential for developers who rely on coding agents.
Understanding the Coding Agent Workload
The benchmark emphasizes the unique characteristics of coding agent requests, which are inherently complex. Each request incorporates extensive context, including the file being edited and prior conversation history. The lengthy inputs, averaging between 45,000 and 200,000 tokens, combined with the need for quick responses, make coding agents particularly sensitive to latency issues. With typical outputs around 450 tokens, the system must generate responses efficiently, even as the number of concurrent users increases.
As more developers interact with the coding agent at the same time, the KV cache fills, leading to heightened scheduling pressure and decreased throughput per user. The time to first token (TTFT) becomes a critical metric, measuring the delay from request submission to the arrival of the first output token. In this context, TTFT is vital; any delay can undermine user trust in the system's responsiveness.
Benchmark Methodology
NVIDIA’s benchmarking methodology incorporates several variables to simulate realistic conditions. It utilizes four NVIDIA B200 GPUs for each engine, creating a high-traffic environment that mirrors the request patterns seen in production coding workflows. The benchmark evaluates various performance metrics, including tokens per minute (TPM), tokens per second (TPS) per user, and TTFT, under high concurrency conditions. This approach reveals how different engines manage increased load and its implications for user experience.
The benchmark also examines the impact of output shape. Coding agents generate functions instead of long-form text, which shifts the dynamics of throughput compared to other AI workloads. Focusing on prefill pressure rather than sustained decode pressure is crucial for understanding how these systems perform under stress.
Results and Implications
Initial results from the benchmarking show that NVIDIA’s Together Inference Engine significantly outperforms its competitors. At a workload of 2.5 million tokens per minute (TPM), Together delivers 31% more TPS than the TensorRT-LLM engine and is the only system maintaining TTFT under one second. Remarkably, at traffic levels where all engines experience degradation, Together’s TTFT is twice as efficient as TensorRT-LLM and three times better than SGLang, highlighting its ability to manage high demand without sacrificing performance.
This benchmarking also emphasizes the cost-effectiveness of the Together Inference Engine. Operating at high efficiency translates to substantial savings for teams. For instance, a 30-person engineering team using the platform at an average workload can save approximately $440,000 annually compared to using Claude Opus 4.6.
Future Directions
NVIDIA’s release of this benchmark reflects a commitment to transparency and ongoing improvement. The company plans to update the benchmarks regularly, offering a clear record of performance enhancements and optimizations. This initiative encourages organizations running coding agents at scale to consider the implications of these findings on their operations.
This initial version marks a significant moment for AI inference benchmarks, establishing a standard for meaningful metrics based on real-world workloads. As the field evolves, the focus will remain on ensuring that benchmarks accurately reflect the complexities of actual usage scenarios, allowing developers to make informed decisions as they integrate AI into their workflows.



