SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
About
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Software Engineering | SWE-bench Verified | Success Rate36.6 | 29 | |
| Medical Agent Task Execution | MedAgentBench | Success Rate14.2 | 24 | |
| Software Engineering Tool Use | SWE-bench Verified | Success Rate19.5 | 12 | |
| General Deep Research Tool Use | HLE | Success Rate6.9 | 12 | |
| Domain Deep Research Tool Use | FinSearchComp Global-T2 | Success Rate30.5 | 12 | |
| Financial Specialist Tool Use | Finance Agent Benchmark | Success Rate3 | 12 | |
| General Deep Research Tool Use | GAIA | Success Rate23.2 | 12 | |
| General Deep Research Tool Use | Browsecomp | Success Rate1.6 | 12 | |
| General Deep Research Tool Use | xbench DeepSearch | Success Rate31.6 | 12 | |
| In-distribution Tool Use | DIVE-Eval | Success Rate13.8 | 12 |