The latest evaluations of open AI models reveal a concerning trend: the performance gap between open and closed models is widening. A recent report by the Center for AI Standards and Innovation (CAISI) highlights the difficulties that open frontier models face in matching their closed counterparts, raising questions about the future of open AI development.
CAISI's findings show that open models, such as DeepSeek V4, are struggling to compete against established closed models across various benchmarks. The evaluation used an Elo scoring system based on Item Response Theory to compare models under different testing conditions. This month, several new models were released by developers including DeepSeek, Kimi, and GLM, all of which were examined for their performance.
DeepSeek V4 received a notably low score in critical benchmarks like CTF-Archive-Diamond and PortBench. The differences in benchmarking methods, along with a limited data subset, have contributed to this widening performance gap. CAISI's analysis suggests that the gap between open models and their closed counterparts has remained consistent, estimated at three to seven months since the initial release of their respective models.
While CAISI and Epoch AI's ECI analyses offer some insights, both face criticism for relying on simplified evaluation setups. These setups often fail to reflect the complexities of real-world coding tasks. For example, the benchmarks assessing coding capabilities do not utilize advanced tools and harnesses, potentially misrepresenting the true potential of the models. Florian, a key contributor to the report, argues that open models are closer in performance to closed alternatives than the data indicates. In contrast, Nathan provides a more critical perspective, claiming that closed models have a more significant advantage.
Several noteworthy projects have released updates this month. Xiaomi's ShareMiMo V2.5 Pro has shown significant progress, now competing with flagship models like Kimi K2.6 and GLM-5.1. Google’s Gemma 4 series has introduced multiple model sizes, including a new 26B-A4B MoE version, and has adopted an Apache 2.0 license, which may reduce some legal uncertainties regarding its use.
Kimi K2.6 continues to impress with improvements in performance and task longevity, while Laguna-XS.2 from Poolside AI marks the company's entry into coding-focused models. DeepSeek's V4 Flash, featuring impressive specifications, has generated interest, although early feedback suggests its Pro version may not meet expectations.
As the AI field evolves, recent developments underscore the need to refine benchmarking methods to better capture model capabilities. The ongoing dialogue among researchers and developers about the evaluation of open versus closed models will likely influence future advancements in the sector.
The mixed results from recent evaluations illustrate the challenges and potential of open AI models. As these projects progress, improving benchmark methodologies could be key to bridging the performance gap that currently exists between open and closed models.
Quick answers
What is the main finding of the CAISI report?
The report indicates that open models are increasingly lagging behind closed models in performance benchmarks.
Which new models were released this month?
New models include DeepSeek V4, Kimi K2.6, ShareMiMo V2.5 Pro, and Gemma 4.
How does the benchmarking method affect model evaluation?
Simplified benchmarking methods may not accurately reflect real-world performance, leading to misleading conclusions about a model's capabilities.
What improvements have been made in the Kimi model?
Kimi K2.6 has shown stronger performance across various tasks, particularly in long-duration tasks.



