Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Agent Task on TRIP-Bench (Hard FIT)
Loading...
18
Loose Success Rate
GPT-5.2
-0.72
4.14
9
13.86
Feb 2, 2026
Loose Success Rate
Strict Success Rate
Updated 3mo ago
Evaluation Results
Method
Method
Links
Loose Success Rate
Strict Success Rate
GPT-5.2
thinking mode=w/ thinking
2026.02
18
0
DeepSeek-V3.2
thinking mode=w/ thinking
2026.02
14
0
Claude-Sonnet-4.5
thinking mode=w/ thinking
2026.02
10
0
DeepSeek-V3.2
thinking mode=w/o thin...
2026.02
8
0
GPT-5.2
thinking mode=w/o thin...
2026.02
6
0
Claude-Sonnet-4.5
thinking mode=w/o thin...
2026.02
6
0
Gemini-3-Flash
thinking mode=w/o thin...
2026.02
6
0
Gemini-3-Pro
thinking mode=w/o thin...
2026.02
4
0
Qwen3-235B-A22B-Instruct-2507
thinking mode=w/o thin...
2026.02
2
0
Kimi-K2-0905-Preview
thinking mode=w/o thin...
2026.02
0
0
GLM-4.7
thinking mode=w/o thin...
2026.02
0
0
Qwen3-235B-A22B-Thinking-2507
thinking mode=w/ thinking
2026.02
0
0
Kimi-K2-Thinking
thinking mode=w/ thinking
2026.02
0
0
Gemini-3-Pro
thinking mode=w/ thinking
2026.02
0
0
GLM-4.7
thinking mode=w/ thinking
2026.02
0
0
Gemini-3-Flash
thinking mode=w/ thinking
2026.02
0
0
Feedback
Search any
task
Search any
task