Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Best-of-N evaluation on RewardBench v2
Loading...
58.69
Accuracy
PC2-based LLM-as-a-Judge
55.3828
56.2414
57.1
57.9586
May 10, 2025
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
PC2-based LLM-as-a-Judge
Evaluation Method=ours
2025.05
58.69
Naive Pointwise Evaluation
Evaluation Method=naive
2025.05
55.51
Feedback
Search any
task
Search any
task