Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Research Solution Evaluation on ICLR problems 2026 (test)
Loading...
56
Feasibility Win%
Mistral-24B
48.616
50.533
52.45
54.367
Oct 6, 2025
Feasibility Win%
Feasibility p-value
Problem Solving Win%
Problem Solving p-value
Originality Win%
Originality p-value
Updated 1mo ago
Evaluation Results
Method
Method
Links
Feasibility Win%
Feasibility p-value
Problem Solving Win%
Problem Solving p-value
Originality Win%
Originality p-value
Mistral-24B
Comparison Target=Human
2025.10
56
0.48
70.7
0.01
58.7
0.3
Combined LLM
Comparison Target=Human
2025.10
52.6
0.68
73.5
0
65.2
0.006
GPT-OSS-120B
Comparison Target=Mist...
2025.10
50
1
45.5
0.73
67.9
0.09
GPT-OSS-120B
Comparison Target=Human
2025.10
48.9
1
76.2
0.0009
72.1
0.005
Feedback
Search any
task
Search any
task