Chips & Hardware May 22 ago

ZFLOW AI Achieves 1.54× Throughput Boost on NVIDIA B300 for DeepSeek V4-Pro

ZFLOW AI's recent optimization on NVIDIA's B300 platform has led to a 1.54× increase in throughput for DeepSeek V4-Pro, showcasing advanced simulation techniques in AI deployment.

GPUBeat Desk

Desk · GPUBeat Media

Published

May 22 · 16:49 ET

Reading

2 min · 399 words

A recent optimization effort by ZFLOW AI has resulted in a substantial 1.54× increase in throughput for the DeepSeek V4-Pro model when deployed on PaleBlueDot AI's 8×NVIDIA B300 platform. This breakthrough was achieved using a simulation-guided approach to develop an optimized serving configuration on the SGLang stack, marking a first in the public domain for such a model on NVIDIA’s production hardware.

ZFLOW AI is positioning itself as a key player in AI infrastructure by creating a neutral optimization and control layer that functions between serving runtimes and decision-making processes. This framework helps infrastructure teams identify the most cost-effective and high-performance methods for executing workloads across various hardware setups.

Optimizing AI Workloads with Advanced Simulation

In this instance, ZFLOW AI concentrated on the DeepSeek V4-Pro model, using SGLang and EAGLE speculative decoding. The analysis examined essential factors such as serving architecture tradeoffs, throughput under high concurrency, and latency performance. The optimized configuration achieved a peak throughput of 826 tokens per second—significantly outperforming the traditional monolithic setup. While the disaggregated configuration excelled under high concurrency, the monolithic path maintained advantages for single-stream workloads that required extensive context processing.

Dr. Zhibin Xiao, Founder and CEO of ZFLOW AI, commented on the evolution of inference optimization, stating, “Modern inference optimization is moving beyond manual tuning of individual runtime knobs.” This highlights the shift towards a more integrated approach that combines real workload execution with hardware simulation and optimization strategies.

Implications for Future Deployments

ZFLOW AI's findings suggest that a two-node configuration using the B300 could be a viable option for future production deployments. The next step will involve validating this on actual hardware. The team is developing full closed-loop auto-optimization capabilities for DeepSeek V4-Pro on the B300, with plans to release a detailed Technical Insights blog that will elaborate on their findings, particularly regarding MTP/EAGLE optimization and multi-node deployment strategies.

For organizations interested in the capabilities of DeepSeek V4-Pro or other advanced models on the B300 or similar GPU platforms, ZFLOW AI is open to discussions about optimizing specific workloads.

Looking Ahead

As AI infrastructure continues to evolve, ZFLOW AI's advancements suggest a future where optimization becomes increasingly automated. This will allow teams to maximize the performance of their hardware without being tied to specific vendors. The ongoing development of these capabilities promises improved efficiency and effectiveness in deploying AI models on advanced hardware.

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.

2033 stories

Optimizing AI Workloads with Advanced Simulation

Implications for Future Deployments

Looking Ahead

GPUBeat Desk

More on chips & hardware

Norway’s National Library Leverages 2 PB of Huawei Storage for LLM Training

China’s AI Development: Adapting to U.S. Export Controls on Nvidia

DeepSeek Cuts V4-Pro AI Model Prices by 75% Amid Increased Competition