SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

About

Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.

Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, Yuxiao Dong• 2025

Related benchmarks

Task	Dataset	Result
Software Engineering Task Resolution	SWE-bench Verified	Resolution Rate36.6	63
Agentic Coding	SWE-bench Verified	Percentage Resolved23.4	56
Software Engineering	SWE-bench Verified	Success Rate36.6	31
Medical Agent Task Execution	MedAgentBench	Success Rate14.2	24
Software Engineering Tool Use	SWE-bench Verified	Success Rate19.5	12
General Deep Research Tool Use	HLE	Success Rate6.9	12
Domain Deep Research Tool Use	FinSearchComp Global-T2	Success Rate30.5	12
Financial Specialist Tool Use	Finance Agent Benchmark	Success Rate3	12
General Deep Research Tool Use	GAIA	Success Rate23.2	12
General Deep Research Tool Use	Browsecomp	Success Rate1.6	12

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord