Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Compositional Reasoning on D4 V2 (test)
Loading...
89.9
Stability (%)
Qwen2.5-7B-Inst
78.356
81.353
84.35
87.347
May 26, 2026
Stability (%)
Residual Composition Failure Rate (d=2)
Residual Composition Failure Rate (d=4)
Residual Composition Failure Rate (d=6)
Residual Composition Failure Rate (d=8)
Updated 7d ago
Evaluation Results
Method
Method
Links
Stability (%)
Residual Composition Failure Rate (d=2)
Residual Composition Failure Rate (d=4)
Residual Composition Failure Rate (d=6)
Residual Composition Failure Rate (d=8)
Qwen2.5-7B-Inst
Post-training recipe=R...
2026.05
89.9
12
69.8
78.9
81.1
DeepHermes-3
Post-training recipe=S...
2026.05
88
47.4
61.9
73.1
100
Qwen3-8B
Post-training recipe=R...
2026.05
87.7
0
55.3
78.9
100
Mistral-7B-Inst
Post-training recipe=R...
2026.05
78.8
25
55.6
66.7
-
Feedback
Search any
task
Search any
task