Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific-fact composition and temporal reasoning on E3 Cross-domain pilot
Loading...
0.438
Example Stability Count
Qwen2.5-7B
0.270949
0.314307
0.357664
0.401022
May 26, 2026
Example Stability Count
Consistency Stable Count
Gate Score
Consistency Failure Rate
Updated 7d ago
Evaluation Results
Method
Method
Links
Example Stability Count
Consistency Stable Count
Gate Score
Consistency Failure Rate
Qwen2.5-7B
Parameters=7B
2026.05
0.438
0.6983
98
26
Qwen3-8B
Parameters=8B
2026.05
0.3674
0.7129
100
21
DeepHermes-3-8B
Parameters=8B
2026.05
0.3212
0.6764
88
21
Mistral-7B
Parameters=7B
2026.05
0.2774
0.6107
77
25
Feedback
Search any
task
Search any
task