Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop Reasoning and Fact-checking on FRAMES
Loading...
90.6
Average @3
Tongyi-DeepResearch-30B
56.8
65.575
74.35
83.125
Nov 14, 2025
Average @3
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average @3
Tongyi-DeepResearch-30B
Type=Research Agents
2025.11
90.6
MiroThinker-v1.0-72B
Parameters=72B
2025.11
87.1
MiroThinker-v1.0-30B
Parameters=30B
2025.11
85.4
Claude-4.5-Sonnet
Type=Foundation Models...
2025.11
85
OpenAI-o3
Type=Foundation Models...
2025.11
84
DeepSeek-V3.1
Type=Foundation Models...
2025.11
83.7
SFR-DeepResearch-20B
Type=Research Agents
2025.11
82.8
Claude-4-Sonnet
Type=Foundation Models...
2025.11
80.7
MiroThinker-v1.0-8B
Parameters=8B
2025.11
80.6
DeepSeek-V3.2
Type=Foundation Models...
2025.11
80.2
Kimi-Researcher
Type=Research Agents
2025.11
78.8
WebExplorer-8B-RL
Type=Research Agents
2025.11
75.7
Kimi-K2-0905
Type=Foundation Models...
2025.11
58.1
Feedback
Search any
task
Search any
task