Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SWE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Model Learning from Noisy DataSWE (Shallow Water Equations) system
Full-field Avg Relative Error4.27
18
Software EngineeringSWE Verified
Resolution Rate77.2
17
Role claritySWE (dev total)
Total Role Clarity Score90.79
8
Role claritySWE hard (dev)
Role Clarity Score90.76
8
Role claritySWE easy (dev)
Role Clarity Score0.9081
8
CodeSWE Verified Agentless
pass@157.6
8
Software Engineering AutomationSWE Multilingual
Resolved70.2
5
Role ConsistencySWE dev full set (test)
Total Overstepping Rate (<INFO>)8.4
4
Role ConsistencySWE Dev hard (test)
Overstepping Rate (<INFO>)6.8
4
Role ConsistencySWE easy subset dev (test)
Overstepping Rate (<INFO>)10
4
Multi-Agent Collaboration Role OversteppingSWE total full set (dev)
Overstepping Rate (<INFO>)0.2
4
Multi-Agent Collaboration Role OversteppingSWE hard subset (dev)
Overstepping Rate (<INFO>)0
4
Multi-Agent Collaboration Role OversteppingSWE easy (dev)
Overstepping Rate (<INFO>)0.4
4
Watermark DetectionSWE (test)
Delta Q (Δ̂q)0.71
4
Agent Trajectory PerformanceSWE (test)
Pass@1 Accuracy (%)12.7
4
Historical normalizationswe historical normalization (test)
Accuracy0.579
4
Solution PredictionSWE
Relative L2 Error (Data)2.15
3
Software EngineeringSWE Verified MEDIUM reasoning
Overall Score53.3
2
Learning PDE DynamicsSWE
Relative L2 Error0.005
2
State RolloutSWE
Metric-
0
Showing 20 of 20 rows