Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeepResearch

Benchmarks

Task NameDataset NameSOTA ResultTrend
Deep ResearchDeepResearch Bench
RACE Overall54.02
31
Judge Agreement AccuracyDeepResearch 1319 queries (test)
Agreement Accuracy74.5
19
Long-form deep researchDeepResearch Bench (test)
Overall Score48.24
13
Question AnsweringDeepResearch
HotpotQA Score44.7
12
Deep ResearchDeepResearch benchmark
Average Score53.4
8
Multimodal Report GenerationDeepResearch Bench
DLB3.72
7
Multi-hop Question AnsweringDeepResearch-9K 29 (test)
P-hat0.45
6
Showing 7 of 7 rows