Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Linguistically Diverse Reasoning on ProverQA
Loading...
94
Accuracy (Easy)
AR
37.216
51.958
66.7
81.442
Oct 21, 2025
Accuracy (Easy)
Accuracy (Medium)
Accuracy (Hard)
Updated 22d ago
Evaluation Results
Method
Method
Links
Accuracy (Easy)
Accuracy (Medium)
Accuracy (Hard)
AR
Model Backbone=Gemma2 9B
2025.10
94
91.4
69.6
AR
Model Backbone=Llama3....
2025.10
92.8
91
70.8
GPT-4o
2025.10
81
65.4
46.4
Llama3.1 70B it
Model Backbone=Llama3....
2025.10
74.8
58.8
41
Gemma2 27B it
Model Backbone=Gemma2...
2025.10
74.8
69
46.8
DeepSeek-R1-8B
Model Backbone=DeepSee...
2025.10
65.6
58.6
44.2
Llama3.1 8B
Model Backbone=Llama3....
2025.10
43.6
33.6
36.8
Gemma2 9B
Model Backbone=Gemma2 9B
2025.10
39.4
29.8
25.8
Feedback
Search any
task
Search any
task