Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AgentRewardBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingAgentRewardBench
Precision83.8
12
Web agent meta-evaluationAGENTREWARDBENCH
Overall Precision89.21
8
Reward VerificationAgentRewardBench VisualWebArena
Precision100
7
Showing 3 of 3 rows