Frontier Models May 17 ago

Open AI Models Show Mixed Progress Amid Rising Benchmark Gaps

New evaluations from CAISI highlight the challenges faced by open AI models, showing significant performance gaps compared to their closed counterparts. Key updates from various projects offer a glimpse into the state of open AI development.

GPUBeat Desk

Desk · GPUBeat Media

Published

May 17 · 01:37 ET

Reading

3 min · 556 words

Evaluation of open AI models — DeepSeek V4, Florian — Open AI Models Show Mixed Progress Amid Rising Benchmark Gaps Source: GPUBeat

The latest evaluations of open AI models reveal a concerning trend: the performance gap between open and closed models is widening. A recent report by the Center for AI Standards and Innovation (CAISI) highlights the difficulties that open frontier models face in matching their closed counterparts, raising questions about the future of open AI development.

CAISI's findings show that open models, such as DeepSeek V4, are struggling to compete against established closed models across various benchmarks. The evaluation used an Elo scoring system based on Item Response Theory to compare models under different testing conditions. This month, several new models were released by developers including DeepSeek, Kimi, and GLM, all of which were examined for their performance.

DeepSeek V4 received a notably low score in critical benchmarks like CTF-Archive-Diamond and PortBench. The differences in benchmarking methods, along with a limited data subset, have contributed to this widening performance gap. CAISI's analysis suggests that the gap between open models and their closed counterparts has remained consistent, estimated at three to seven months since the initial release of their respective models.

While CAISI and Epoch AI's ECI analyses offer some insights, both face criticism for relying on simplified evaluation setups. These setups often fail to reflect the complexities of real-world coding tasks. For example, the benchmarks assessing coding capabilities do not utilize advanced tools and harnesses, potentially misrepresenting the true potential of the models. Florian, a key contributor to the report, argues that open models are closer in performance to closed alternatives than the data indicates. In contrast, Nathan provides a more critical perspective, claiming that closed models have a more significant advantage.

Several noteworthy projects have released updates this month. Xiaomi's ShareMiMo V2.5 Pro has shown significant progress, now competing with flagship models like Kimi K2.6 and GLM-5.1. Google’s Gemma 4 series has introduced multiple model sizes, including a new 26B-A4B MoE version, and has adopted an Apache 2.0 license, which may reduce some legal uncertainties regarding its use.

https://x.com/Designarena/status/2054776484833952000

Kimi K2.6 continues to impress with improvements in performance and task longevity, while Laguna-XS.2 from Poolside AI marks the company's entry into coding-focused models. DeepSeek's V4 Flash, featuring impressive specifications, has generated interest, although early feedback suggests its Pro version may not meet expectations.

As the AI field evolves, recent developments underscore the need to refine benchmarking methods to better capture model capabilities. The ongoing dialogue among researchers and developers about the evaluation of open versus closed models will likely influence future advancements in the sector.

The mixed results from recent evaluations illustrate the challenges and potential of open AI models. As these projects progress, improving benchmark methodologies could be key to bridging the performance gap that currently exists between open and closed models.

Quick answers

What is the main finding of the CAISI report?

The report indicates that open models are increasingly lagging behind closed models in performance benchmarks.

Which new models were released this month?

New models include DeepSeek V4, Kimi K2.6, ShareMiMo V2.5 Pro, and Gemma 4.

How does the benchmarking method affect model evaluation?

Simplified benchmarking methods may not accurately reflect real-world performance, leading to misleading conclusions about a model's capabilities.

What improvements have been made in the Kimi model?

Kimi K2.6 has shown stronger performance across various tasks, particularly in long-duration tasks.

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.

2033 stories

Quick answers

What is the main finding of the CAISI report?

Which new models were released this month?

How does the benchmarking method affect model evaluation?

What improvements have been made in the Kimi model?

GPUBeat Desk

More on frontier models

Infratil CEO Highlights Untapped Data Center Potential in ANZ

Anthropic’s Olah Calls for Broader Oversight in AI Development

SK Telecom Partners with Defense Ministry to Advance AI in Military