Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-context reasoning on BrowseComp+ 1K documents
Loading...
94.6
Accuracy
SRLM (no sub-calls)
-3.784
21.758
47.3
72.842
Mar 7, 2026
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
SRLM (no sub-calls)
Backbone=GPT-5
2026.03
94.6
SRLM
Backbone=GPT-5
2026.03
92.4
RLM (no sub-calls)
Backbone=GPT-5
2026.03
89.7
RLM
Backbone=GPT-5
2026.03
86
Summary agent
Backbone=GPT-5
2026.03
70.5
SRLM
Backbone=Qwen3-Coder-480B
2026.03
59.7
CodeAct (+ BM25)
Backbone=GPT-5
2026.03
51
SRLM (no sub-calls)
Backbone=Qwen3-Coder-480B
2026.03
50.1
Summary agent
Backbone=Qwen3-Coder-480B
2026.03
38
RLM
Backbone=Qwen3-Coder-480B
2026.03
37.1
RLM (no sub-calls)
Backbone=Qwen3-Coder-480B
2026.03
36.3
CodeAct (+ BM25)
Backbone=Qwen3-Coder-480B
2026.03
12.7
Base Model
Backbone=Qwen3-Coder-480B
2026.03
0
CodeAct (+ sub-calls)
Backbone=Qwen3-Coder-480B
2026.03
0
Base Model
Backbone=GPT-5
2026.03
0
CodeAct (+ sub-calls)
Backbone=GPT-5
2026.03
0
Feedback
Search any
task
Search any
task