Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Quality Evaluation on Data reasoning tasks
Loading...
37.9
Correctness
Force Weak
37.38
37.515
37.65
37.785
Jan 27, 2026
Correctness
Code Style & Vis
Robustness & Safety
Efficiency
Updated 3mo ago
Evaluation Results
Method
Method
Links
Correctness
Code Style & Vis
Robustness & Safety
Efficiency
Force Weak
Routing Strategy=Force...
2026.01
37.9
27.6
17.8
9.9
Force Strong
Routing Strategy=Force...
2026.01
37.4
27.1
16.8
10
CASTER
Routing Strategy=CASTER
2026.01
37.4
27.4
17.4
9.9
Feedback
Search any
task
Search any
task