Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-horizon agentic task on ResearchRubrics
Loading...
49.36
Performance
AggAgent
31.0144
35.7772
40.54
45.3028
Apr 13, 2026
Performance
Updated 5d ago
Evaluation Results
Method
Method
Links
Performance
AggAgent
Backbone=Qwen3.5-122B,...
2026.04
49.36
AggAgent
Backbone=MiniMax-M2.5,...
2026.04
45.42
AggAgent
Backbone=GLM-4.7-Flash...
2026.04
45.31
Solution Aggregation
Backbone=MiniMax-M2.5,...
2026.04
44
Best-of-N
Backbone=Qwen3.5-122B,...
2026.04
42.37
Solution Aggregation
Backbone=Qwen3.5-122B,...
2026.04
42.1
Pass@1
Backbone=Qwen3.5-122B,...
2026.04
40.5
Summary Aggregation
Backbone=MiniMax-M2.5,...
2026.04
40.29
Pass@1
Backbone=MiniMax-M2.5,...
2026.04
39.97
Fewest Tool Calls
Backbone=Qwen3.5-122B,...
2026.04
39.58
Best-of-N
Backbone=MiniMax-M2.5,...
2026.04
39
Fewest Tool Calls
Backbone=MiniMax-M2.5,...
2026.04
38.44
Best-of-N
Backbone=GLM-4.7-Flash...
2026.04
37.7
Pass@1
Backbone=GLM-4.7-Flash...
2026.04
37.47
Summary Aggregation
Backbone=Qwen3.5-122B,...
2026.04
37.47
Solution Aggregation
Backbone=GLM-4.7-Flash...
2026.04
36.84
Fewest Tool Calls
Backbone=GLM-4.7-Flash...
2026.04
35.21
Summary Aggregation
Backbone=GLM-4.7-Flash...
2026.04
31.72
Feedback
Search any
task
Search any
task