Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BrowseComp+

Benchmarks

Task NameDataset NameSOTA ResultTrend
Web SearchBrowseComp-Plus
Pass@353.02
60
Deep research agents / Multi-step reasoningBrowseComp-Plus OOD
Success Rate (SR)54.6
24
Complex TasksBrowseComp+ Complex Tasks 2nd Pass
Accuracy89
16
Long-context reasoningBrowseComp+ 1K documents
Accuracy94.6
16
Web Browsing and Tool UseBrowseComp+ original (test)
Performance (%)38.72
15
Complex Task SolvingBrowseComp+ Naive Stream
Accuracy (1st-Q)55
8
Complex Task SolvingBrowseComp+ Compositional Stream
Accuracy (1st-Q)90
8
SubtasksBrowseComp+ Subtasks 1st Pass
Accuracy97.7
8
Web Browsing ReasoningBrowseComp+
Avg@8 Accuracy11
7
Scaling Model ValidationBrowseComp-Plus Out-of-sample (val)
MAE0.071
1
Showing 10 of 10 rows