Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLMBar

Benchmarks

Task NameDataset NameSOTA ResultTrend
Robustness EvaluationLLMBar
Accuracy83.07
8
LLM-as-a-Judge CalibrationLLMBar (test)
Test Risk (MSE)0.194
7
Reward ModelingLLMBar (test)
Test MSE (Table)0.2039
5
Showing 3 of 3 rows