Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-task Evaluation on Aggregate (HealthQA, ARC-C, PopQA, Squad1, ASQA)
Loading...
62.8
Average Score
AMATA 8B
34.096
41.548
49
56.452
May 17, 2026
Average Score
Updated 15d ago
Evaluation Results
Method
Method
Links
Average Score
AMATA 8B
Model Scale=8B, Backbo...
2026.05
62.8
GiGPO 8B
Model Scale=8B, Backbo...
2026.05
60.1
SPA-RL 8B
Model Scale=8B, Backbo...
2026.05
59.76
SMART 8B
Model Scale=8B, Backbo...
2026.05
59.29
SelfRag 8B
Model Scale=8B, Backbo...
2026.05
51.81
GPT4o
Backbone Architecture=...
2026.05
49.17
Llama-3-Ins.8B
Model Scale=8B, Backbo...
2026.05
35.25
RADIT 8B
Model Scale=8B, Backbo...
2026.05
35.2
Feedback
Search any
task
Search any
task