Inference & Serving May 22 ago

AMD’s Two-Phase Initialization Technique Dramatically Enhances LLM Inference

AMD's innovative two-phase deferred initialization technique significantly cuts down LLM inference startup time, achieving a reduction of up to 10× on its Ryzen AI processors.

GPUBeat Desk

Desk · GPUBeat Media

Published

May 22 · 06:53 ET

Reading

2 min · 483 words

Optimizing LLM Inference with AMD Processors — AMD, LLM inference — AMD’s Two-Phase Initialization Technique Dramatically Enhances LLM Inference Source: GPUBeat

The recent advancements in on-device processing by AMD are poised to transform large language model (LLM) inference. A newly developed two-phase deferred initialization technique aims to improve LLM efficiency by significantly cutting startup times, which is essential for real-time applications. This method reduces model initialization times from around 10 seconds to just 1 second while preserving inference correctness.

AMD’s Ryzen™ AI processors employ a hybrid execution model for LLM inference, allocating tasks between the neural processing unit (NPU) and the integrated GPU (iGPU). The NPU manages the compute-intensive prefill phase, achieving performance levels of up to 50 TOPS (trillions of operations per second), while the iGPU takes care of the memory-bound decoding phase. This division of responsibilities is key to optimizing performance but also adds complexity to the model loading initialization phase.

Understanding the Initialization Bottleneck

Before token generation can begin, the inference runtime must carefully create and configure various NPU custom operators. This process includes several sequential tasks such as layer normalization, quantized matrix multiplication, and multi-head attention. Each of these operations requires accessing model graphs, allocating device memory, and transferring model weights, all of which are executed on a single thread in the original implementation. This setup slows down the process and results in inefficiencies due to cache pollution, where the CPU's cache memory is repeatedly cleared and reloaded as different operations occur.

The Two-Phase Solution

AMD's two-phase deferred initialization strategy tackles these inefficiencies directly. By decoupling the model-reading tasks from device-setup operations, this approach runs these processes on separate threads. This separation not only prevents cross-domain CPU cache pollution but also improves overall model load efficiency. The outcome is a more efficient process that allows quicker access to model data without compromising accuracy or performance. For instance, using the Qwen3-4B model on AMD Ryzen™ AI processors, this technique has shown a remarkable tenfold improvement in initialization time.

Implications for Future AI Applications

As more applications use LLMs, efficient inference and initialization become increasingly critical. AMD's approach establishes a new benchmark for AI processing performance, especially for applications that demand rapid responses, such as conversational agents and real-time data analysis. By significantly reducing the time to first token (TTFT), AMD is well-positioned to meet the needs of a market that requires instantaneous processing capabilities.

Looking forward, the successful application of this two-phase initialization technique could lead to further developments in AI infrastructure, particularly as the demand for LLMs grows. Optimizing model loading processes not only improves user experience but also expands the potential of AI-driven applications, building further exploration and innovation in the field.

AMD's initiative reflects a wider industry trend aimed at enhancing efficiency in AI workloads, highlighting the critical interplay between hardware and software. As companies continue to invest in AI technologies, AMD's advancements may serve as a model for future innovations, emphasizing the importance of both speed and accuracy in machine learning applications.

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.

2033 stories

Understanding the Initialization Bottleneck

The Two-Phase Solution

Implications for Future AI Applications

GPUBeat Desk

More on inference & serving

CoreWeave CSO Brian Venturo’s $8.36M Stock Sale Amid Financial Strains

CoreWeave Enhances GPU Cloud with Pulumi Integration Amid Russell 3000 Inclusion

Local LLM Inference Achieved with Affordable Intel Optane Memory