A recent benchmark developed by researchers at Carnegie Mellon University has unveiled stark differences in the capabilities of AI agents when exploiting vulnerabilities in Google's JavaScript engine, V8. Anthropic's Claude Mythos significantly outperformed OpenAI's GPT-5.5, demonstrating higher proficiency in reaching advanced exploit tiers, though at a considerable cost.
This benchmark diverges from traditional testing methods by evaluating progress across five distinct tiers of exploitation, culminating in arbitrary code execution rather than merely assessing the triggering of bugs. The V8 engine powers various platforms, including Chrome, Edge, Node.js, and Cloudflare Workers, highlighting the relevance of these findings in real-world applications.
In the latest tests, Mythos achieved an impressive average score of 9.90 out of 16, successfully reaching the highest tier on 21 out of 41 vulnerabilities. In contrast, GPT-5.5 lagged significantly with only 5.51 points, managing to reach the top tier on just two occasions. The disparity in results intensified when both models were tested in fully autonomous modes. Mythos maintained a score of 9.55, while GPT-5.5, utilizing Codex, could only achieve a score of 4.30.
The financial implications of running these tests are as noteworthy as the results. The comprehensive Mythos test, which spanned 122 episodes, incurred a staggering cost of approximately $36,428, compared to GPT-5.5's modest expenditure of around $3,075 for 123 episodes. This roughly twelvefold difference highlights the financial burden associated with utilizing Mythos, despite its superior performance. Notably, the UK's AI Safety Institute corroborated these findings, emphasizing that Mythos outperforms GPT-5.5 but at a significantly higher cost. This price gap could suggest a path for OpenAI to enhance GPT-5.5’s performance through increased computational resources.
ExploitBench co-author Seunghyun Lee, a seasoned security researcher with a track record of over 20 reported browser vulnerabilities, meticulously reviewed the Mythos transcripts. His analysis revealed that the model operates comparably to a skilled browser security researcher. Remarkably, Mythos devised an exploit technique that had previously been deemed too intricate by human experts, successfully replicating a vulnerability (CVE-2024-0519) that had stumped human researchers for over a year.
While researchers acknowledge that the tested vulnerabilities are publicly known and could potentially be drawn from the model's training data, they assert that the dataset also includes vulnerabilities lacking public exploits or detailed reports. The benchmark does not currently assess the AI’s ability to discover new flaws or fully weaponize an exploit for actual attacks, leaving room for future advancements in AI-driven security research.
The benchmark's repository is accessible on GitHub, and the relevant research paper can be found on arXiv. Although both Anthropic and OpenAI provided API credits for the study, the authors maintain that their analysis was conducted independently, ensuring an unbiased evaluation of the results.
Quick answers
What is the main finding of the Carnegie Mellon benchmark?
The benchmark reveals that Anthropic's Claude Mythos significantly outperforms OpenAI's GPT-5.5 in exploiting browser vulnerabilities.
How do the costs of running the tests compare?
Mythos tests cost approximately $36,428 for 122 episodes, while GPT-5.5 tests cost around $3,075 for 123 episodes.
What tiers does the benchmark assess?
The benchmark evaluates progress across five tiers, culminating in arbitrary code execution.
Can the AI models discover new vulnerabilities?
The current benchmark does not measure the ability to find new flaws or fully weaponize an exploit for actual attacks.



