TLDR:
- EVMbench draws from 120 high-severity vulnerabilities curated across 40 real-world smart contract audits.
- GPT-5.3-Codex scored 72.2% in exploit mode, far outperforming GPT-5, which only reached 31.9% in testing.
- Moonwell and CrossCurve both suffered smart contract exploits recently, adding urgency to AI-driven security tools.
- Anthropic’s late 2025 report warned that AI agents can already identify smart contract flaws independently and autonomously.
EVMbench is the latest collaborative effort between OpenAI and crypto investment firm Paradigm. The tool is designed to measure how well AI agents detect, patch, and exploit vulnerabilities in smart contracts.
Built from 120 high-severity vulnerabilities across 40 audits, EVMbench targets the Ethereum Virtual Machine ecosystem.
This development comes amid recent DeFi exploits that have renewed industry focus on smarter, faster contract auditing through artificial intelligence.
EVMbench Tests AI Agents Across Multiple Capability Modes
EVMbench evaluates AI agents across several distinct capability modes. These include detecting vulnerabilities, modifying contract code, and eliminating potential exploitability in deployed contracts.
The benchmark also tests an agent’s ability to execute end-to-end fund-draining attacks in a sandboxed blockchain environment.
OpenAI explained the rationale behind the tool in a blog post on Wednesday. “Smart contracts secure billions of dollars in assets, and AI agents are likely to be transformative for both attackers and defenders,” the company stated. That framing sets the tone for why the benchmark was built in the first place.
The vulnerabilities used in EVMbench were drawn from sponsored open-code audit competitions. They also include security audits conducted for Tempo, a Layer 1 blockchain co-developed by Paradigm and Stripe. This gives the benchmark a real-world foundation rooted in active protocol development.
Early test results show a clear performance gap between AI models. GPT-5.3-Codex scored 72.2% in exploit mode, compared to GPT-5 at just 31.9%. However, coverage for vulnerability detection and patching tasks remains incomplete across both models.
Recent DeFi Attacks Add Urgency to AI-Driven Security Tools
The release of EVMbench follows a series of high-profile smart contract attacks in the DeFi space. Moonwell, a DeFi lending protocol, suffered an exploit this month involving vulnerable code written with AI assistance.
The incident raised fresh concerns about AI-generated code entering production environments without sufficient review.
Around the same time, CrossCurve, a cross-chain liquidity protocol, was compromised through a smart contract vulnerability.
The attack resulted in losses of roughly $3 million across multiple networks. Both incidents point to the growing financial risk tied to unaudited contract code.
OpenAI addressed the broader stakes in its blog post directly. “As AI agents improve at reading, writing, and executing code, it becomes increasingly important to measure their capabilities in economically meaningful environments,” the company wrote. The statement reinforces why a structured benchmark like EVMbench is being introduced now.
Late last year, Anthropic published a separate report on this topic. The report argued that AI agents have already advanced enough to identify smart contract vulnerabilities independently.
As a result, the cost of crypto exploits could decrease over time as AI-powered auditing becomes standard practice.



