Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-horizon agentic task on DeepSearchQA
Loading...
66
Performance
AggAgent
31.0768
40.1434
49.21
58.2766
Apr 13, 2026
Performance
Updated 5d ago
Evaluation Results
Method
Method
Links
Performance
AggAgent
Backbone=Qwen3.5-122B,...
2026.04
66
AggAgent
Backbone=MiniMax-M2.5,...
2026.04
65.33
Summary Aggregation
Backbone=Qwen3.5-122B,...
2026.04
64
Best-of-N
Backbone=MiniMax-M2.5,...
2026.04
64
Solution Aggregation
Backbone=Qwen3.5-122B,...
2026.04
62.67
Solution Aggregation
Backbone=MiniMax-M2.5,...
2026.04
62
Summary Aggregation
Backbone=MiniMax-M2.5,...
2026.04
61.33
Best-of-N
Backbone=Qwen3.5-122B,...
2026.04
57.33
Fewest Tool Calls
Backbone=MiniMax-M2.5,...
2026.04
56
Fewest Tool Calls
Backbone=Qwen3.5-122B,...
2026.04
54.67
Pass@1
Backbone=MiniMax-M2.5,...
2026.04
54.42
AggAgent
Backbone=GLM-4.7-Flash...
2026.04
49.33
Pass@1
Backbone=Qwen3.5-122B,...
2026.04
49.25
Summary Aggregation
Backbone=GLM-4.7-Flash...
2026.04
47.33
Solution Aggregation
Backbone=GLM-4.7-Flash...
2026.04
46
Best-of-N
Backbone=GLM-4.7-Flash...
2026.04
35.33
Fewest Tool Calls
Backbone=GLM-4.7-Flash...
2026.04
33.33
Pass@1
Backbone=GLM-4.7-Flash...
2026.04
32.42
Feedback
Search any
task
Search any
task