| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Model Learning from Noisy Data | SWE (Shallow Water Equations) system | Full-field Avg Relative Error4.27 | 18 | |
| Software Engineering | SWE Verified | Resolution Rate77.2 | 17 | |
| Role clarity | SWE (dev total) | Total Role Clarity Score90.79 | 8 | |
| Role clarity | SWE hard (dev) | Role Clarity Score90.76 | 8 | |
| Role clarity | SWE easy (dev) | Role Clarity Score0.9081 | 8 | |
| Code | SWE Verified Agentless | pass@157.6 | 8 | |
| Software Engineering Automation | SWE Multilingual | Resolved70.2 | 5 | |
| Role Consistency | SWE dev full set (test) | Total Overstepping Rate (<INFO>)8.4 | 4 | |
| Role Consistency | SWE Dev hard (test) | Overstepping Rate (<INFO>)6.8 | 4 | |
| Role Consistency | SWE easy subset dev (test) | Overstepping Rate (<INFO>)10 | 4 | |
| Multi-Agent Collaboration Role Overstepping | SWE total full set (dev) | Overstepping Rate (<INFO>)0.2 | 4 | |
| Multi-Agent Collaboration Role Overstepping | SWE hard subset (dev) | Overstepping Rate (<INFO>)0 | 4 | |
| Multi-Agent Collaboration Role Overstepping | SWE easy (dev) | Overstepping Rate (<INFO>)0.4 | 4 | |
| Watermark Detection | SWE (test) | Delta Q (Δ̂q)0.71 | 4 | |
| Agent Trajectory Performance | SWE (test) | Pass@1 Accuracy (%)12.7 | 4 | |
| Historical normalization | swe historical normalization (test) | Accuracy0.579 | 4 | |
| Solution Prediction | SWE | Relative L2 Error (Data)2.15 | 3 | |
| Software Engineering | SWE Verified MEDIUM reasoning | Overall Score53.3 | 2 | |
| Learning PDE Dynamics | SWE | Relative L2 Error0.005 | 2 | |
| State Rollout | SWE | Metric- | 0 |