Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Reasoning on TheoremQA
Loading...
42.3
Accuracy
TAB
37.932
39.066
40.2
41.334
Apr 6, 2026
Accuracy
Token Count
Updated 11d ago
Evaluation Results
Method
Method
Links
Accuracy
Token Count
TAB
Budget (B)=10k
2026.04
42.3
6,566
Static
Budget=4096
2026.04
42.1
10,754
TAB
Budget (B)=8k
2026.04
41.5
4,987
TAB
Budget (B)=5k
2026.04
40.9
3,141
LLM-Judge Multi-Turn
Selection Strategy=Mul...
2026.04
40.8
5,375
Static
Budget=2048
2026.04
40.3
5,988
Static
Budget=1024
2026.04
40
3,975
TAB
Budget (B)=3k
2026.04
39.9
2,320
LLM-Judge Individual
Selection Strategy=Ind...
2026.04
39.8
5,916
Static
Budget=512
2026.04
39.5
2,578
Static
Budget=256
2026.04
38.1
1,778
Feedback
Search any
task
Search any
task