Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Best-of-N evaluation on RMB
Loading...
59.69
Accuracy
PC2-based LLM-as-a-Judge
39.9196
45.0523
50.185
55.3177
May 10, 2025
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
PC2-based LLM-as-a-Judge
Evaluation Method=ours
2025.05
59.69
Naive Pointwise Evaluation
Evaluation Method=naive
2025.05
40.68
Feedback
Search any
task
Search any
task