Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reward Verification on AgentRewardBench VisualWebArena
Loading...
100
Precision
Rule-based (VWA Oracle)
71.92
79.21
86.5
93.79
Jul 15, 2025
Precision
Updated 1mo ago
Evaluation Results
Method
Method
Links
Precision
Rule-based (VWA Oracle)
version=ours
2025.07
100
SGV
backbone=GPT-o4
2025.07
86
Rule-based (VWA Oracle)
version=original
2025.07
85
SGV
backbone=Gemini 2.5, m...
2025.07
80
No-SGV Baseline
backbone=GPT-o4
2025.07
80
WebJudge
backbone=GPT-o4, statu...
2025.07
75
No-SGV Baseline
backbone=Gemini 2.5, m...
2025.07
73
Feedback
Search any
task
Search any
task