Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Agent Task on TRIP-Bench Mid
Loading...
55
Loose Success Rate
GPT-5.2
-2.2
12.65
27.5
42.35
Feb 2, 2026
Loose Success Rate
Strict Success Rate
Updated 3mo ago
Evaluation Results
Method
Method
Links
Loose Success Rate
Strict Success Rate
GPT-5.2
thinking mode=w/ thinking
2026.02
55
13
DeepSeek-V3.2
thinking mode=w/ thinking
2026.02
41
9
Claude-Sonnet-4.5
thinking mode=w/ thinking
2026.02
31
6
GLM-4.7
thinking mode=w/ thinking
2026.02
29
0
Gemini-3-Flash
thinking mode=w/ thinking
2026.02
25
0
GLM-4.7
thinking mode=w/o thin...
2026.02
20
0
DeepSeek-V3.2
thinking mode=w/o thin...
2026.02
20
3
Claude-Sonnet-4.5
thinking mode=w/o thin...
2026.02
18
0
Gemini-3-Pro
thinking mode=w/ thinking
2026.02
16
0
GPT-5.2
thinking mode=w/o thin...
2026.02
14
0
Gemini-3-Flash
thinking mode=w/o thin...
2026.02
11
0
Gemini-3-Pro
thinking mode=w/o thin...
2026.02
9
0
Kimi-K2-Thinking
thinking mode=w/ thinking
2026.02
8
4
Qwen3-235B-A22B-Instruct-2507
thinking mode=w/o thin...
2026.02
5
0
Kimi-K2-0905-Preview
thinking mode=w/o thin...
2026.02
0
0
Qwen3-235B-A22B-Thinking-2507
thinking mode=w/ thinking
2026.02
0
0
Feedback
Search any
task
Search any
task