Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning/Planning on Today
Loading...
81.7
Accuracy
ANCHOR
57.364
63.682
70
76.318
May 11, 2026
Accuracy
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy
ANCHOR
Backbone=Qwen2.5-72B,...
2026.05
81.7
CoT
Backbone=DeepSeek-V3-6...
2026.05
80.3
CoT
Backbone=DeepSeek-V3-6...
2026.05
78.3
BIRD
Backbone=Qwen2.5-72B,...
2026.05
75.4
CoT
Backbone=Qwen2.5-72B,...
2026.05
71.3
Vanilla
Backbone=DeepSeek-V3-6...
2026.05
71
Vanilla
Backbone=DeepSeek-V3-6...
2026.05
64.7
Vanilla
Backbone=Qwen2.5-72B,...
2026.05
59.7
CoT
Backbone=Qwen2.5-72B,...
2026.05
59.3
Vanilla
Backbone=Qwen2.5-72B,...
2026.05
58.3
Feedback
Search any
task
Search any
task