Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Agent Task on TRIP-Bench Easy
Loading...
71
Loose Success Count
DeepSeek-V3.2
-2.84
16.33
35.5
54.67
Feb 2, 2026
Loose Success Count
Strict Success Count
Updated 3mo ago
Evaluation Results
Method
Method
Links
Loose Success Count
Strict Success Count
DeepSeek-V3.2
thinking mode=w/ thinking
2026.02
71
31
GPT-5.2
thinking mode=w/ thinking
2026.02
66
49
Claude-Sonnet-4.5
thinking mode=w/ thinking
2026.02
58
27
Gemini-3-Pro
thinking mode=w/o thin...
2026.02
44
12
Gemini-3-Flash
thinking mode=w/ thinking
2026.02
44
25
Gemini-3-Pro
thinking mode=w/ thinking
2026.02
42
11
DeepSeek-V3.2
thinking mode=w/o thin...
2026.02
39
5
GLM-4.7
thinking mode=w/ thinking
2026.02
38
16
Claude-Sonnet-4.5
thinking mode=w/o thin...
2026.02
36
7
Gemini-3-Flash
thinking mode=w/o thin...
2026.02
36
22
Kimi-K2-Thinking
thinking mode=w/ thinking
2026.02
35
5
GLM-4.7
thinking mode=w/o thin...
2026.02
34
0
GPT-5.2
thinking mode=w/o thin...
2026.02
24
2
Qwen3-235B-A22B-Instruct-2507
thinking mode=w/o thin...
2026.02
16
2
Kimi-K2-0905-Preview
thinking mode=w/o thin...
2026.02
13
0
Qwen3-235B-A22B-Thinking-2507
thinking mode=w/ thinking
2026.02
0
0
Feedback
Search any
task
Search any
task