Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
In-distribution Tool Use on DIVE-Eval
Loading...
45.3
Success Rate
Gemini-3-Pro
11.708
20.429
29.15
37.871
Mar 10, 2026
Success Rate
Updated 1mo ago
Evaluation Results
Method
Method
Links
Success Rate
Gemini-3-Pro
Category=Frontier (≫8B...
2026.03
45.3
Claude-4-Sonnet
Category=Frontier (≫8B...
2026.03
44.8
DIVE-8B (RL)
Category=Ours, Tempera...
2026.03
42.5
GPT-OSS-120B
Category=Frontier (≫8B...
2026.03
40.5
DeepSeek-V3.2-Exp
Category=Frontier (≫8B...
2026.03
40.4
DIVE-8B (SFT)
Category=Ours, Tempera...
2026.03
35.4
Kimi-K2-0905
Category=Frontier (≫8B...
2026.03
32.9
Gemini-2.5-Pro
Category=Frontier (≫8B...
2026.03
29.1
WebExplorer-8B
Category=8B Baselines,...
2026.03
19.1
EnvScaler-8B
Category=8B Baselines,...
2026.03
15.4
SWE-Dev-8B
Category=8B Baselines,...
2026.03
13.8
Qwen3-8B (base)
Category=Ours, Tempera...
2026.03
13
Feedback
Search any
task
Search any
task