Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reinforcement Learning on Tomato
Loading...
6.28
True Score
ORPO
3.9088
4.5244
5.14
5.7556
Apr 13, 2026
True Score
Proxy Score
Worst Score
Occurrence
Worst* Score
Updated 4d ago
Evaluation Results
Method
Method
Links
True Score
Proxy Score
Worst Score
Occurrence
Worst* Score
ORPO
Training Reward=Proxy...
2026.04
6.28
6.83
-1.51
0.0003
-1.51
Max-Min
Training Reward=Proxy...
2026.04
4.56
4.68
-1.37
0
-1.37
ORPO*
Training Reward=Proxy...
2026.04
4
3.98
-1.09
0
-1.09
Feedback
Search any
task
Search any
task