Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-form research on HealthBench
Loading...
59.5
Overall Score
GPT-5 + Search
-2.484
13.608
29.7
45.792
May 11, 2026
Overall Score
Updated 22d ago
Evaluation Results
Method
Method
Links
Overall Score
GPT-5 + Search
Category=Closed Deep R...
2026.05
59.5
OpenAI Deep Research
Category=Closed Deep R...
2026.05
53.8
DR Tulu-8B (RL, 1900 steps)
Category=Open Deep Res...
2026.05
50.2
RubricEM-8B (RL, 1400 steps)
Backbone=8B, Training=...
2026.05
49.3
Gemini 3.1 Pro + Search
Category=Closed Deep R...
2026.05
47.5
Tongyi DeepResearch-30B-A3B
Category=Open Deep Res...
2026.05
46.2
WebThinker-32B-DPO
Category=Fixed Pipelin...
2026.05
39.4
RubricEM-8B (SFT)
Backbone=8B, Training=SFT
2026.05
39
DR Tulu-8B (SFT)
Category=Open Deep Res...
2026.05
38.1
WebThinker QwQ-32B
Category=Fixed Pipelin...
2026.05
36.5
WebExplorer-8B
Category=Open Deep Res...
2026.05
33.7
Ai2 ScholarQA – Claude Sonnet
Category=Fixed Pipelin...
2026.05
32
Qwen3-8B + Our Search
Backbone=Qwen3-8B, Sea...
2026.05
24.5
Search-R1-7B
Category=Open Deep Res...
2026.05
-0.1
Feedback
Search any
task
Search any
task