Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-step Reasoning and Factuality on FRAMES (Pass@1)
Loading...
90.6
Pass@1
Tongyi DeepResearch
71.256
76.278
81.3
86.322
Oct 28, 2025
Pass@1
Updated 15d ago
Evaluation Results
Method
Method
Links
Pass@1
Tongyi DeepResearch
Agent Type=DeepResearc...
2025.10
90.6
OpenAI o3
Agent Type=LLM-based R...
2025.10
84
DeepSeek-V3.1
Agent Type=LLM-based R...
2025.10
83.7
Claude-4-Sonnet
Agent Type=LLM-based R...
2025.10
80.7
GLM 4.5
Agent Type=LLM-based R...
2025.10
78.9
Kimi Researcher
Agent Type=DeepResearc...
2025.10
78.8
Kimi K2
Agent Type=LLM-based R...
2025.10
72
Feedback
Search any
task
Search any
task