Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-horizon agentic task on Healthbench Hard
Loading...
28.06
Performance
AggAgent
4.4312
10.5656
16.7
22.8344
Apr 13, 2026
Performance
Updated 5d ago
Evaluation Results
Method
Method
Links
Performance
AggAgent
Backbone=Qwen3.5-122B,...
2026.04
28.06
AggAgent
Backbone=GLM-4.7-Flash...
2026.04
27.99
Solution Aggregation
Backbone=Qwen3.5-122B,...
2026.04
26.3
AggAgent
Backbone=MiniMax-M2.5,...
2026.04
24.46
Summary Aggregation
Backbone=Qwen3.5-122B,...
2026.04
23
Solution Aggregation
Backbone=MiniMax-M2.5,...
2026.04
21.84
Summary Aggregation
Backbone=MiniMax-M2.5,...
2026.04
16.92
Solution Aggregation
Backbone=GLM-4.7-Flash...
2026.04
15.72
Best-of-N
Backbone=Qwen3.5-122B,...
2026.04
13.01
Pass@1
Backbone=Qwen3.5-122B,...
2026.04
12.87
Fewest Tool Calls
Backbone=Qwen3.5-122B,...
2026.04
12.83
Best-of-N
Backbone=GLM-4.7-Flash...
2026.04
9.91
Pass@1
Backbone=MiniMax-M2.5,...
2026.04
9.67
Fewest Tool Calls
Backbone=GLM-4.7-Flash...
2026.04
8.9
Best-of-N
Backbone=MiniMax-M2.5,...
2026.04
8.79
Pass@1
Backbone=GLM-4.7-Flash...
2026.04
8.67
Summary Aggregation
Backbone=GLM-4.7-Flash...
2026.04
7.35
Fewest Tool Calls
Backbone=MiniMax-M2.5,...
2026.04
5.34
Feedback
Search any
task
Search any
task