Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Quality Evaluation on Software reasoning tasks
Loading...
35.2
Functional Correctness
Force Strong
31.664
32.582
33.5
34.418
Jan 27, 2026
Functional Correctness
Robustness & Security
Engineering Quality
Code Style
Updated 3mo ago
Evaluation Results
Method
Method
Links
Functional Correctness
Robustness & Security
Engineering Quality
Code Style
Force Strong
Routing Strategy=Force...
2026.01
35.2
26
17.8
8.5
CASTER
Routing Strategy=CASTER
2026.01
34.5
25.2
16.1
9.2
Force Weak
Routing Strategy=Force...
2026.01
31.8
25.2
16.5
8.5
Feedback
Search any
task
Search any
task