In an impressive display of ingenuity, a Reddit user has managed to run a 1-trillion-parameter language model locally by use Intel's discontinued Optane Persistent Memory. This achievement was made possible with a workstation featuring a Xeon processor and a GPU, demonstrating the potential of cost-effective memory solutions in AI applications.
The Redditor, known as APFrisco, shared this accomplishment on the Local LLaMA subreddit, explaining how they used six Optane PMem DIMMs, each with a capacity of 128GB, to create a total of 768GB of memory. This setup enabled them to run the Kimi K2.5 model at a performance rate of approximately four tokens per second. Although Optane's latency remains higher than traditional DRAM, its affordability and relatively low latency make it an appealing choice for local inference of large language models.
Optane memory was designed to bridge the gap between DRAM and SSDs, providing a unique solution for high-performance computing tasks. While production has ceased, the second-hand market offers these memory modules at a significantly lower price compared to equivalent DRAM capacities. APFrisco noted that the total cost for their build was “much less than what the equivalent DRAM capacity would cost,” making it a practical option for those on tight budgets.
Hardware Configuration and Performance
APFrisco's workstation boasted impressive hardware, including an Intel Xeon Gold 6246 CPU, an Asus Dual GeForce RTX 3060 GPU, and a Tyan motherboard. The combination of six Optane PMem sticks and six Samsung DDR4 ECC DRAM sticks enabled a hybrid caching setup that effectively optimized performance. The innovative use of llama.cpp for inference further enhanced processing capabilities, allowing for efficient resource allocation within the GPU's memory.
The performance metrics achieved by APFrisco highlight the capabilities of their setup. Running a trillion-parameter model locally is no small feat, especially on a limited hardware budget. The Redditor expressed pride in the system's output, stating, “Given the fact that this is a trillion-parameter frontier-class model running on such a limited hardware budget, I would consider it to be a great success.” This sentiment emphasizes the potential for similar setups to contribute to the expanding field of AI and machine learning.
Looking Ahead: Bridging Memory Gaps
The broader implications of this achievement extend beyond individual hardware builds. With the increasing demand for memory solutions tailored to AI workloads, there is a pressing need for products that can fill the space between DRAM and SSDs. The industry is closely monitoring the development of the CXL (Compute Express Link) standard, which promises to deliver large pools of affordable, byte-addressable memory, potentially transforming AI infrastructure.
As researchers and developers strive to push the limits of AI models, configurations like APFrisco's could pave the way for more affordable and accessible local inference solutions. The success of such builds may spark further interest in alternative memory technologies, encouraging innovation and exploration in the AI sector.
The deployment of a 1-trillion-parameter model on a budget-friendly Intel Optane setup not only showcases the promise of second-hand technology but also highlights a growing need for memory solutions designed to meet the demands of modern AI applications.



