Share your thoughts, 1 month free Claude Pro on usSee more

SWE

Benchmarks

Task Name	Dataset Name	SOTA Result
Software Engineering	SWE Lite	Throughput (tok/s)10,538	30
Shallow Water Equations Simulation	SWE	Base Loss0	29
Model Learning from Noisy Data	SWE (Shallow Water Equations) system	Full-field Avg Relative Error4.27	18
Software Engineering	SWE Verified	Resolution Rate77.2	17
Software Engineering Repair	SWE Multi	SWE Average Score40	10
Sensor-Space Imputation	SWE Synthetic PDE	RMSE0.0462	9
Global field reconstruction	SWE Synthetic PDE	RMSE0.1799	9
Role clarity	SWE (dev total)	Total Role Clarity Score90.79	8
Role clarity	SWE hard (dev)	Role Clarity Score90.76	8
Role clarity	SWE easy (dev)	Role Clarity Score0.9081	8
Code	SWE Verified Agentless	pass@157.6	8
Software Engineering	SWE	MTP Acceptance Rate81.9	5
Software Engineering Automation	SWE Multilingual	Resolved70.2	5
Role Consistency	SWE dev full set (test)	Total Overstepping Rate (<INFO>)8.4	4
Role Consistency	SWE Dev hard (test)	Overstepping Rate (<INFO>)6.8	4
Role Consistency	SWE easy subset dev (test)	Overstepping Rate (<INFO>)10	4
Multi-Agent Collaboration Role Overstepping	SWE total full set (dev)	Overstepping Rate (<INFO>)0.2	4
Multi-Agent Collaboration Role Overstepping	SWE hard subset (dev)	Overstepping Rate (<INFO>)0	4
Multi-Agent Collaboration Role Overstepping	SWE easy (dev)	Overstepping Rate (<INFO>)0.4	4
Watermark Detection	SWE (test)	Delta Q (Δ̂q)0.71	4
Agent Trajectory Performance	SWE (test)	Pass@1 Accuracy (%)12.7	4
Historical normalization	swe historical normalization (test)	Accuracy0.579	4
Language Modeling	SWE	Perplexity1.216	3
Solution Prediction	SWE	Relative L2 Error (Data)2.15	3
Task Generation	SWE OOD (held-out repositories)	Utility19.59	2

Showing 25 of 29 rows