Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Reasoning Average on Aggregate (OBQA, CSQA, SIQA, ARC, MMLU, GSM8K-MC, AQUA)
Loading...
86.21
Average Accuracy
IoT
71.3588
75.2144
79.07
82.9256
Mar 15, 2026
Average Accuracy
Updated 10d ago
Evaluation Results
Method
Method
Links
Average Accuracy
IoT
Model=GPT-4o mini
2026.03
86.21
CoT
Model=GPT-4o mini
2026.03
85.23
IoT
Model=Olmo-2-13B
2026.03
80.69
SC
Model=Olmo-2-13B
2026.03
80.04
CoT
Model=Olmo-2-13B
2026.03
78.75
IoT
Model=Olmo-2-7B
2026.03
77.35
EoT
Model=Olmo-2-13B
2026.03
76.13
CoT
Model=Olmo-2-7B
2026.03
75.38
IoT
Model=Llama-3.3-8B
2026.03
75.22
SC
Model=Llama-3.3-8B
2026.03
74.89
SC
Model=Olmo-2-7B
2026.03
74.81
EoT
Model=Llama-3.3-8B
2026.03
72.38
CoT
Model=Llama-3.3-8B
2026.03
72.28
EoT
Model=Olmo-2-7B
2026.03
71.93
Feedback
Search any
task
Search any
task