Skip to main content
GPUBeat Frontier Models New Benchmark Reveals AI Healthcare Agents’…

New Benchmark Reveals AI Healthcare Agents’ Struggles with Workflows

A new benchmark shows top AI healthcare agents from OpenAI and Anthropic fail 72% of clinical workflows, raising concerns about their readiness for real-world applications.

OpenAI — ai-agents — OpenAI, Anthropic
New Benchmark Reveals AI Healthcare Agents’ Struggles with Workflows Source: GPUBeat

A recent benchmark study has revealed a troubling reality for leading AI agents in healthcare: they fail to successfully navigate 72% of clinical workflows. Conducted by actAVA.ai, the CHI-Bench benchmark tested 30 advanced AI agents from notable companies including OpenAI and Anthropic across 75 healthcare workflows.

The findings show that even the top-performing agent, Anthropic's Claude Code, passed only 28% of the tasks, while OpenAI's Codex followed at 21%. These results indicate that many AI systems, despite being marketed as capable of managing extensive workflows, struggle significantly with the complexities of real-world healthcare operations.

CHI-Bench was developed in collaboration with over 20 institutions, including prestigious health systems like Johns Hopkins and Yale, as well as universities such as Stanford and Oxford. This benchmark marks a significant advancement in evaluating AI agents’ capabilities in healthcare, emphasizing their ability to handle multi-step processes that span various roles and departments, rather than focusing solely on narrow clinical knowledge.

Each trial in CHI-Bench required agents to navigate between 60 and 80 steps across four to six clinical stages. This involved interfacing with 21 healthcare applications and utilizing over 200 MCP tools, along with a comprehensive 1,279-document operations handbook. The evaluation criteria were stringent, employing deterministic unit tests and an LLM judge to assess evidence grounding, consent, and cross-stage consistency.

Despite the substantial resources invested in these AI systems, reliability emerged as a major concern. When the same case was rerun three times, no agent achieved a success rate above 20%. In endurance testing, where agents handled 25 cases in a single session, the leading system completed fewer than 4% of tasks successfully. In scenarios where one AI submitted a prior authorization request while another acted as the reviewer, not a single task was completed successfully.

See also  xAI Unveils Grok Build, A New Player in AI Coding Agents

This benchmark highlights the gap between AI capabilities in controlled environments and the demands placed on these systems in real healthcare settings. As Haolin Chen, lead author of the benchmark, noted, “These workflows are long, role-composed, and gated by policy. An agent has to play intake clerk, nurse reviewer, and medical director across sixty-plus steps where one wrong site-of-service flip cascades into multiple failures.”

The implications of these findings are significant. Healthcare operations require agents to read clinical notes, apply specific medical policies, generate compliant determination letters, and route outcomes effectively—tasks that demand a level of reliability that current AI systems do not yet provide.

Weiran Yao, Chief AI Officer at actAVA, emphasized the benchmark's importance, stating, “We need to know whether an agent can carry a real case end-to-end without error.” The CHI-Bench initiative seeks to address this critical question, offering an open-source benchmark under Apache 2.0 on GitHub, where community submissions are now encouraged.

As the healthcare sector increasingly turns to AI for automation and efficiency, the findings from CHI-Bench serve as a stark reminder of the work that lies ahead. While advancements in AI technology continue to progress, the need for rigorous testing and validation in real-world applications is more urgent than ever. The healthcare industry must balance the promise of AI with the reality of its current limitations, ensuring that patient care remains the top priority in adopting these technologies.

Quick answers

Which AI agents were tested in CHI-Bench?

The benchmark tested agents from Anthropic, OpenAI, Google, x.AI, DeepSeek, and Z.ai.

What were the success rates of the top AI agents?

Anthropic's Claude Code achieved a 28% success rate, while OpenAI's Codex reached 21%.

What are the implications of the CHI-Bench findings?

The findings highlight significant reliability issues in AI agents, indicating they are not yet ready to handle complex healthcare workflows effectively.

GD

GPUBeat Desk

Desk · joined 2026

GPUBeat Desk covers AI infrastructure — chips, foundation models, inference economics, datacenter buildouts, and the geopolitics of compute.