Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long Context Understanding on HELMET
Loading...
68.5
Accuracy
Synthetic Reasoning
35.74
44.245
52.75
61.255
Dec 30, 2025
Jan 14, 2026
Jan 29, 2026
Feb 13, 2026
Feb 28, 2026
Mar 15, 2026
Mar 31, 2026
Accuracy
Updated 13d ago
Evaluation Results
Method
Method
Links
Accuracy
Synthetic Reasoning
Model family=Qwen3 VL
2026.03
68.5
Qwen3 VL 235B A22B Instruct
Model family=Qwen3 VL
2026.03
67.6
No-think
Model family=Qwen3 VL
2026.03
65.9
Plain Distillation
Model family=Qwen3 VL
2026.03
65.7
LongCat-Flash Exp-Chat
Evaluation Mode=Chat
2025.12
64.7
GLM 4.6
Evaluation Mode=Chat
2025.12
64.6
Qwen Thinking Traces
Model family=Mistral
2026.03
64.1
Qwen3 VL 32B Instruct
Model family=Qwen3 VL
2026.03
63
LongPO
Model family=Qwen3 VL
2026.03
62.9
Synthetic Reasoning
Model family=Mistral
2026.03
62.6
DeepSeek V3.2
Evaluation Mode=Chat
2025.12
59.5
LongCat-Flash Chat
Evaluation Mode=Chat
2025.12
59.1
No-think
Model family=Mistral
2026.03
55.8
Plain Distillation
Model family=Mistral
2026.03
53.1
Mistral 3.1 Small 24B
Model family=Mistral
2026.03
37
Feedback
Search any
task
Search any
task