MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
About
MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Static Replay Attack Detection | 100 Targeted Seed Episodes (redcode_style profile) | AMR80.43 | 5 | |
| Static Replay Attack Detection | 100 Targeted Seed Episodes profile (orig) | AMR2.17 | 5 | |
| Malicious Action Detection | stress test 100-seed redcode_style profile | AMR80.43 | 4 | |
| Malicious Action Detection | 100-seed stress orig profile (test) | AMR2.17 | 4 |