Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

About

Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.

Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, Yuxiao Dong• 2025

Related benchmarks

TaskDatasetResultRank
Software EngineeringSWE-bench Verified
Success Rate36.6
29
Medical Agent Task ExecutionMedAgentBench
Success Rate14.2
24
Software Engineering Tool UseSWE-bench Verified
Success Rate19.5
12
General Deep Research Tool UseHLE
Success Rate6.9
12
Domain Deep Research Tool UseFinSearchComp Global-T2
Success Rate30.5
12
Financial Specialist Tool UseFinance Agent Benchmark
Success Rate3
12
General Deep Research Tool UseGAIA
Success Rate23.2
12
General Deep Research Tool UseBrowsecomp
Success Rate1.6
12
General Deep Research Tool Usexbench DeepSearch
Success Rate31.6
12
In-distribution Tool UseDIVE-Eval
Success Rate13.8
12
Showing 10 of 12 rows

Other info

Follow for update