OpenZeppelin Uncovers Data Contamination in OpenAI EVMbench

What to Know

4 high-severity vulnerabilities in EVMbench were found to be non-exploitable, according to OpenZeppelin's audit
Training data contamination undermined the benchmark because top-scoring AI agents had likely seen vulnerability reports during pretraining
EVMbench launched in mid-February with 120 curated audits from 2024 to mid-2025 to test AI agents on smart contract security
OpenZeppelin stressed that flawed benchmarks risk misrepresenting AI capabilities in blockchain security

OpenZeppelin has exposed critical methodological flaws in OpenAI's EVMbench, the artificial intelligence benchmark built to evaluate how well AI models handle blockchain security tasks. The security firm announced on Monday that its independent audit uncovered training data contamination and improperly classified vulnerabilities, raising concerns about the benchmark's reliability.

OpenZeppelin Audit Reveals Flawed Methodology

OpenZeppelin stated in an X post on Monday that it had put EVMbench through the same scrutiny it applies to the protocols it secures, including decentralized finance heavyweights Aave, Lido, and Uniswap. The firm said it welcomed the initiative but identified two core problems: training data contamination and invalid vulnerability classifications.

EVMbench was introduced in mid-February as a joint effort between OpenAI and crypto investment firm Paradigm. The benchmark measures how effectively AI models can identify, patch, and exploit smart contract vulnerabilities, drawing from 120 audits conducted between 2024 and mid-2025.

What Data Contamination Did OpenZeppelin Find?

The most critical flaw involves training data leakage. OpenZeppelin emphasized that the most important capability in AI security is finding novel vulnerabilities in code a model has never seen. However, the top-scoring agents on EVMbench, including Anthropic's Claude Open 4.6, OpenAI's OC-GPT-5.2, and Google's Gemini 3 Pro, had knowledge training cutoffs extending to mid-2025.

Because the benchmark dataset drew from audits conducted between 2024 and mid-2025, OpenZeppelin concluded these top performers had likely been exposed to the vulnerability reports during pretraining. Internet access was disabled during testing to prevent agents from looking up solutions, but the models may have already stored the answers in their parameters.

The firm cautioned that the dataset's limited scope amplifies these risks. "While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test. The dataset's limited size further narrows the evaluation surface, making these contamination concerns more significant," OpenZeppelin said in a statement.

We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice.
— OpenZeppelin, via official statement

Invalid Vulnerability Classifications Undermine Results

Beyond data contamination, OpenZeppelin flagged factual errors in how EVMbench categorized vulnerabilities. The firm assessed at least four vulnerabilities labeled high risk and found none of them actually function as described. Despite this, EVMbench had been awarding AI agents credit for identifying these non-exploitable flaws.

OpenZeppelin was explicit that these discrepancies go beyond subjective severity disagreements. "These aren't subjective severity disagreements; they are findings where the described exploit doesn't work," the firm stated. The benchmark was effectively rewarding false positives, skewing performance scores. Paradigm co-developed EVMbench alongside OpenAI, and the findings raise questions about the review process before launch.

What This Means for AI in Blockchain Security

OpenZeppelin reiterated that artificial intelligence will play a transformative role in strengthening blockchain security. However, the firm stressed the technology must be evaluated with proper methodology to deliver on its promise. Flawed benchmarks risk overstating AI agent capabilities and could mislead developers protecting smart contracts holding billions in user funds.

The audit highlights a growing challenge: ensuring evaluations reflect genuine capability rather than memorized training data. As AI-driven security tools become more prevalent in decentralized finance, the industry will need benchmarks built on datasets outside model training windows with verified vulnerability classifications.

OpenAI Wins Defense Deal Hours After Gov Drops Anthropic