Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Role Consistency on SWE Dev hard (test)
Loading...
6.8
Overstepping Rate (<INFO>)
ChatDev (Joint FT)
5.084
16.667
28.25
39.833
Apr 3, 2026
Overstepping Rate (<INFO>)
Overstepping Rate (INFO)
Δ Overstepping Rate (%) (<INFO>)
Δ Overstepping Rate (%) (INFO)
Updated 13d ago
Evaluation Results
Method
Method
Links
Overstepping Rate (<INFO>)
Overstepping Rate (INFO)
Δ Overstepping Rate (%) (<INFO>)
Δ Overstepping Rate (%) (INFO)
ChatDev (Joint FT)
CEO agent training=FT,...
2026.04
6.8
7.6
-42.9
-60.6
ChatDev (Base CEO, FT CPO)
CEO agent training=Bas...
2026.04
16.8
30.4
-32.9
-37.8
ChatDev (FT CEO, Base CPO)
CEO agent training=FT,...
2026.04
44
44
-5.7
-24.2
ChatDev (Base-Base)
CEO agent training=Bas...
2026.04
49.7
68.2
-
-
Feedback
Search any
task
Search any
task