Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Compositional Reasoning on Harder-set
Loading...
51.3
Strict Accuracy
Qwen2.5-7B
-2.052
11.799
25.65
39.501
May 26, 2026
Strict Accuracy
Raw Accuracy
Semantic Accuracy
Adjusted Resolution Accuracy
Updated 7d ago
Evaluation Results
Method
Method
Links
Strict Accuracy
Raw Accuracy
Semantic Accuracy
Adjusted Resolution Accuracy
Qwen2.5-7B
Parameters=7B
2026.05
51.3
47.6
50.4
50.4
Mistral-7B
Parameters=7B
2026.05
47.9
100
47.9
47.9
DeepHermes-3-8B
Parameters=8B
2026.05
46.7
71.4
46.7
40.2
Qwen3-8B
Parameters=8B
2026.05
3.4
14.3
2.8
3.4
Llama2-7B-Chat
Parameters=7B
2026.05
0
-
-
0
Llama2-13B-Chat
Parameters=13B
2026.05
0
76
76
0
Feedback
Search any
task
Search any
task