Skip to main content
GPUBeat Frontier Models Open AI Models Show Mixed Progress…

Open AI Models Show Mixed Progress Amid Rising Benchmark Gaps

New evaluations from CAISI highlight the challenges faced by open AI models, showing significant performance gaps compared to their closed counterparts. Key updates from various projects offer a glimpse into the state of open AI development.

Evaluation of open AI models — DeepSeek V4, Florian
Open AI Models Show Mixed Progress Amid Rising Benchmark Gaps Source: GPUBeat

The latest evaluations of open AI models reveal a concerning trend: the performance gap between open and closed models is widening. A recent report by the Center for AI Standards and Innovation (CAISI) highlights the difficulties that open frontier models face in matching their closed counterparts, raising questions about the future of open AI development.

CAISI's findings show that open models, such as DeepSeek V4, are struggling to compete against established closed models across various benchmarks. The evaluation used an Elo scoring system based on Item Response Theory to compare models under different testing conditions. This month, several new models were released by developers including DeepSeek, Kimi, and GLM, all of which were examined for their performance.

DeepSeek V4 received a notably low score in critical benchmarks like CTF-Archive-Diamond and PortBench. The differences in benchmarking methods, along with a limited data subset, have contributed to this widening performance gap. CAISI's analysis suggests that the gap between open models and their closed counterparts has remained consistent, estimated at three to seven months since the initial release of their respective models.

While CAISI and Epoch AI's ECI analyses offer some insights, both face criticism for relying on simplified evaluation setups. These setups often fail to reflect the complexities of real-world coding tasks. For example, the benchmarks assessing coding capabilities do not utilize advanced tools and harnesses, potentially misrepresenting the true potential of the models. Florian, a key contributor to the report, argues that open models are closer in performance to closed alternatives than the data indicates. In contrast, Nathan provides a more critical perspective, claiming that closed models have a more significant advantage.

See also  Alibaba's Qwen Unveils New AI Model at Cloud Summit

Several noteworthy projects have released updates this month. Xiaomi's ShareMiMo V2.5 Pro has shown significant progress, now competing with flagship models like Kimi K2.6 and GLM-5.1. Google’s Gemma 4 series has introduced multiple model sizes, including a new 26B-A4B MoE version, and has adopted an Apache 2.0 license, which may reduce some legal uncertainties regarding its use.

Kimi K2.6 continues to impress with improvements in performance and task longevity, while Laguna-XS.2 from Poolside AI marks the company's entry into coding-focused models. DeepSeek's V4 Flash, featuring impressive specifications, has generated interest, although early feedback suggests its Pro version may not meet expectations.

As the AI field evolves, recent developments underscore the need to refine benchmarking methods to better capture model capabilities. The ongoing dialogue among researchers and developers about the evaluation of open versus closed models will likely influence future advancements in the sector.

The mixed results from recent evaluations illustrate the challenges and potential of open AI models. As these projects progress, improving benchmark methodologies could be key to bridging the performance gap that currently exists between open and closed models.

Quick answers

What is the main finding of the CAISI report?

The report indicates that open models are increasingly lagging behind closed models in performance benchmarks.

Which new models were released this month?

New models include DeepSeek V4, Kimi K2.6, ShareMiMo V2.5 Pro, and Gemma 4.

What improvements have been made in the Kimi model?

Kimi K2.6 has shown stronger performance across various tasks, particularly in long-duration tasks.

GD

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.