Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CUAVerifierBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agreement with process human labelsCUAVerifierBench Browserbase OM2W
Accuracy78
8
Agreement with outcome human labelsCUAVerifierBench Browserbase OM2W (n=106)
Accuracy88
8
Agreement with process human labelsCUAVerifierBench (Internal Dataset (n=140))
Accuracy81
8
Agreement with outcome human labelsCUAVerifierBench Internal Dataset
Accuracy81
8
Showing 4 of 4 rows