| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agreement with process human labels | CUAVerifierBench Browserbase OM2W | Accuracy78 | 8 | |
| Agreement with outcome human labels | CUAVerifierBench Browserbase OM2W (n=106) | Accuracy88 | 8 | |
| Agreement with process human labels | CUAVerifierBench (Internal Dataset (n=140)) | Accuracy81 | 8 | |
| Agreement with outcome human labels | CUAVerifierBench Internal Dataset | Accuracy81 | 8 |