Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General AI Assistant Tasks on GAIA Out-of-Distribution
Loading...
47
Accuracy
Claude Sonnet 4
8.52
18.51
28.5
38.49
Jul 22, 2025
Accuracy
Relevance
Full Coverage
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
Relevance
Full Coverage
Claude Sonnet 4
Model Category=Closed-...
2025.07
47
74
26
GPT-4.1
Model Category=Closed-...
2025.07
37
46
54
Deliberative Searcher-72B
Model Category=70B Mod...
2025.07
35
78
6
GPT-4o
Model Category=Closed-...
2025.07
26
77
23
Deliberative Searcher-DeepSeek-70B
Model Category=70B Mod...
2025.07
24
78
11
R1-Searcher-7B
Model Category=7B Mode...
2025.07
20
35
65
DeepSeek-R1-Distill-70B
Model Category=70B Mod...
2025.07
18
16
84
ReSearch-7B
Model Category=7B Models
2025.07
16
22
78
Deliberative Searcher-7B
Model Category=7B Mode...
2025.07
16
90
1
Deliberative Searcher-7B
Model Category=7B Mode...
2025.07
15
89
1
Qwen2.5-VL-72B
Model Category=70B Mod...
2025.07
14
39
61
Qwen2.5-VL-7B
Model Category=7B Mode...
2025.07
13
43
57
InternVL3-78B
Model Category=70B Mod...
2025.07
12
48
52
Search-R1-7B
Model Category=7B Mode...
2025.07
10
44
56
Feedback
Search any
task
Search any
task