A recent optimization effort by ZFLOW AI has resulted in a substantial 1.54× increase in throughput for the DeepSeek V4-Pro model when deployed on PaleBlueDot AI's 8×NVIDIA B300 platform. This breakthrough was achieved using a simulation-guided approach to develop an optimized serving configuration on the SGLang stack, marking a first in the public domain for such a model on NVIDIA’s production hardware.
ZFLOW AI is positioning itself as a key player in AI infrastructure by creating a neutral optimization and control layer that functions between serving runtimes and decision-making processes. This framework helps infrastructure teams identify the most cost-effective and high-performance methods for executing workloads across various hardware setups.
Optimizing AI Workloads with Advanced Simulation
In this instance, ZFLOW AI concentrated on the DeepSeek V4-Pro model, using SGLang and EAGLE speculative decoding. The analysis examined essential factors such as serving architecture tradeoffs, throughput under high concurrency, and latency performance. The optimized configuration achieved a peak throughput of 826 tokens per second—significantly outperforming the traditional monolithic setup. While the disaggregated configuration excelled under high concurrency, the monolithic path maintained advantages for single-stream workloads that required extensive context processing.
Dr. Zhibin Xiao, Founder and CEO of ZFLOW AI, commented on the evolution of inference optimization, stating, “Modern inference optimization is moving beyond manual tuning of individual runtime knobs.” This highlights the shift towards a more integrated approach that combines real workload execution with hardware simulation and optimization strategies.
Implications for Future Deployments
ZFLOW AI's findings suggest that a two-node configuration using the B300 could be a viable option for future production deployments. The next step will involve validating this on actual hardware. The team is developing full closed-loop auto-optimization capabilities for DeepSeek V4-Pro on the B300, with plans to release a detailed Technical Insights blog that will elaborate on their findings, particularly regarding MTP/EAGLE optimization and multi-node deployment strategies.
For organizations interested in the capabilities of DeepSeek V4-Pro or other advanced models on the B300 or similar GPU platforms, ZFLOW AI is open to discussions about optimizing specific workloads.
Looking Ahead
As AI infrastructure continues to evolve, ZFLOW AI's advancements suggest a future where optimization becomes increasingly automated. This will allow teams to maximize the performance of their hardware without being tied to specific vendors. The ongoing development of these capabilities promises improved efficiency and effectiveness in deploying AI models on advanced hardware.