Share your thoughts, 1 month free Claude Pro on usSee more

Interactive Reasoning on Timely-Eval

34.9Zork1 Score

Gemini2.5-pro

Updated 4mo ago

Evaluation Results

Method	Links
Gemini2.5-pro 2026.01		34.9	50.7	34.1	71.9
GPT-5.1(medium) 2026.01		34.1	57.6	24.4	105
TimelyLM-8B 2026.01		27.5	48.5	29.5	88.1
DeepSeek-V3.2 2026.01		24.9	48.7	15.9	63.2
Qwen3-32B 2026.01		14.4	38.2	11.7	70
Qwen3-14B 2026.01		9.8	34.9	9.5	50.5
Qwen3-8B 2026.01		2.3	36	5.2	54.1