Skip to main content
GPUBeat Chips & Hardware Norway’s National Library Leverages 2 PB…

Norway’s National Library Leverages 2 PB of Huawei Storage for LLM Training

Norway's National Library is utilising 2 PB of Huawei flash storage to develop a sovereign LLM tailored to the Norwegian language, addressing cultural and historical gaps in AI training.

Norway is embarking on an ambitious project to develop a large language model (LLM) that understands and processes the Norwegian language, utilizing 2 petabytes of Huawei OceanStor Dorado flash storage. This initiative is led by the National Library of Norway, which houses the largest digital collection of Norwegian literature and cultural content. Marius Husnes, the Head of IT Platform at the library, emphasized the necessity of a local language model, stating that existing commercial LLMs do not address the unique linguistic and cultural nuances of Norway.

The project is driven by Norway's Ministry of Culture, which tasked the library with creating a sovereign AI model. Husnes noted that any nation without a dedicated LLM for its language risks losing cultural representation in an AI-driven world. The library’s extensive archives, built up since its digitization efforts began in 2005, include 20 PB of unique data, showcasing a wide range of Norwegian books, newspapers, and web content. With a legal mandate to preserve the nation's cultural heritage, the library is well-equipped to undertake this significant task.

A key feature of this project is the collaboration with Norwegian newspapers, enabling the library to train the LLM on copyrighted content—an advantage that private companies lack. The library has developed a thorough data pipeline involving ingestion, cleaning, deduplication, and validation to makes sure high-quality input for the LLM. This entire process relies on advanced computational resources, including an Nvidia DGX H200 system and various Huawei OceanStor all-flash arrays, which provide the low-latency storage necessary for efficient data pipelines.

Technical Infrastructure and Challenges

The LLM training takes place on Norway’s national supercomputer, the Sigma2 Olivia system. This HPE Cray Supercomputing EX system features 448 GPUs and 64,512 CPU cores, paired with a 5.3 PB Cray ClusterStor E1000 storage system. However, transitioning from the library’s 60 PB preservation system to the AI pipeline has presented significant challenges. The preservation system prioritizes durability and cost over speed, resulting in high read latency that complicates the rapid data access needed for AI training.

See also  xAI Enhances Developer Experience with Grok Integration in OpenCode

Husnes has pointed out that discussions about the logistical challenges of transferring petabyte-scale datasets from archives through AI pipelines are often overlooked. His team has had to tackle these complexities independently, learning how to integrate various systems to support LLM training effectively.

Ongoing Learning and Future Implications

As the project advances, the team is continuously learning about several critical aspects of developing a sovereign Norwegian LLM. Evaluation remains a significant concern, as Husnes highlighted the absence of standardized tools to measure the model's performance. Given the complexities of the Norwegian language, which has two written forms and multiple dialects, the library is in the process of developing its own evaluation metrics.

Governance also poses important questions. Issues regarding who controls access to the sovereign LLM and its applications have yet to be addressed. These considerations underscore the institutional and political factors that accompany the development of such technology.

The coordination of the three systems—the preservation archive, the on-premises AI environment, and the national supercomputer—represents an ongoing effort to makes sure smooth interoperability between them. Husnes believes that Norway's initiatives can serve as a model for other non-English-speaking nations facing similar challenges in developing AI that accurately reflects their unique languages, cultures, and histories.

As Husnes aptly stated, “AI needs custodians, not just builders.” This perspective encapsulates the broader responsibility nations have in shaping the narrative of artificial intelligence, making sure it embodies local realities and cultural contexts. Norway's approach to creating a sovereign LLM not only addresses its own needs but also provides valuable insights for countries worldwide grappling with the complexities of language representation in AI.

See also  AMD's Venice EPYC Processors Begin Production on TSMC's 2nm Technology

Quick answers

What is the main goal of Norway’s National Library in this project?

The primary objective is to develop a sovereign large language model (LLM) that understands the Norwegian language and reflects the country's culture and history.

How is the data for the LLM being processed?

Data is processed through a meticulous pipeline involving ingestion, cleaning, deduplication, and validation, utilizing advanced computational resources.

What challenges is the library facing in LLM development?

Challenges include transferring large datasets from the preservation system to the AI pipeline, ensuring data quality, and addressing governance issues regarding LLM access.

GD

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.