Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Agent Task on TRIP-Bench Overall
Loading...
45
Loose Success Score
GPT-5.2
-1.8
10.35
22.5
34.65
Feb 2, 2026
Loose Success Score
Strict Success Score
Updated 3mo ago
Evaluation Results
Method
Method
Links
Loose Success Score
Strict Success Score
GPT-5.2
thinking mode=w/ thinking
2026.02
45
18.5
DeepSeek-V3.2
thinking mode=w/ thinking
2026.02
40
10.5
Claude-Sonnet-4.5
thinking mode=w/ thinking
2026.02
32
8.5
Gemini-3-Flash
thinking mode=w/ thinking
2026.02
23.3
6.3
GLM-4.7
thinking mode=w/ thinking
2026.02
20.3
4
Gemini-3-Pro
thinking mode=w/ thinking
2026.02
20
2.8
DeepSeek-V3.2
thinking mode=w/o thin...
2026.02
18.5
2.3
Gemini-3-Pro
thinking mode=w/o thin...
2026.02
18
3
Claude-Sonnet-4.5
thinking mode=w/o thin...
2026.02
17.3
1.8
Gemini-3-Flash
thinking mode=w/o thin...
2026.02
17.3
5.5
GLM-4.7
thinking mode=w/o thin...
2026.02
14.8
0
GPT-5.2
thinking mode=w/o thin...
2026.02
13.3
0.5
Kimi-K2-Thinking
thinking mode=w/ thinking
2026.02
10.8
2.3
Qwen3-235B-A22B-Instruct-2507
thinking mode=w/o thin...
2026.02
5.8
0.5
Kimi-K2-0905-Preview
thinking mode=w/o thin...
2026.02
3.3
0
Qwen3-235B-A22B-Thinking-2507
thinking mode=w/ thinking
2026.02
0
0
Feedback
Search any
task
Search any
task