Share your thoughts, 1 month free Claude Pro on usSee more

Interactive Agent Task on TRIP-Bench (Hard FIT)

18Loose Success Rate

GPT-5.2

Updated 4mo ago

Evaluation Results

Method	Links
GPT-5.2 2026.02		18	0
DeepSeek-V3.2 2026.02		14	0
Claude-Sonnet-4.5 2026.02		10	0
DeepSeek-V3.2 2026.02		8	0
GPT-5.2 2026.02		6	0
Claude-Sonnet-4.5 2026.02		6	0
Gemini-3-Flash 2026.02		6	0
Gemini-3-Pro 2026.02		4	0
Qwen3-235B-A22B-Instruct-2507 2026.02		2	0
Kimi-K2-0905-Preview 2026.02		0	0
GLM-4.7 2026.02		0	0
Qwen3-235B-A22B-Thinking-2507 2026.02		0	0
Kimi-K2-Thinking 2026.02		0	0
Gemini-3-Pro 2026.02		0	0
GLM-4.7 2026.02		0	0
Gemini-3-Flash 2026.02		0	0