Share your thoughts, 1 month free Claude Pro on usSee more

Interactive Agent Task on TRIP-Bench Easy

71Loose Success Count

DeepSeek-V3.2

Updated 5mo ago

Evaluation Results

Method	Links
DeepSeek-V3.2 2026.02		71	31
GPT-5.2 2026.02		66	49
Claude-Sonnet-4.5 2026.02		58	27
Gemini-3-Pro 2026.02		44	12
Gemini-3-Flash 2026.02		44	25
Gemini-3-Pro 2026.02		42	11
DeepSeek-V3.2 2026.02		39	5
GLM-4.7 2026.02		38	16
Claude-Sonnet-4.5 2026.02		36	7
Gemini-3-Flash 2026.02		36	22
Kimi-K2-Thinking 2026.02		35	5
GLM-4.7 2026.02		34	0
GPT-5.2 2026.02		24	2
Qwen3-235B-A22B-Instruct-2507 2026.02		16	2
Kimi-K2-0905-Preview 2026.02		13	0
Qwen3-235B-A22B-Thinking-2507 2026.02		0	0