Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-context reasoning on OfficeQA
Loading...
57.14
Accuracy
GEMINI 3.1 FLASH-LITE
11.8376
23.5988
35.36
47.1212
Apr 6, 2026
Accuracy
Updated 11d ago
Evaluation Results
Method
Method
Links
Accuracy
GEMINI 3.1 FLASH-LITE
2026.04
57.14
QWEN3.5-35B-A3B-FP8
Precision=FP8
2026.04
55.74
GEMINI-2.5-PRO
2026.04
53.37
GPT-OSS-20B
Reasoning Effort=High,...
2026.04
46.58
GPT-OSS-20B
Reasoning Effort=High,...
2026.04
37.84
GPT-OSS-120B
2026.04
33.88
GPT-OSS-20B
Reasoning Effort=Low,...
2026.04
26.53
GPT-OSS-20B
Reasoning Effort=Low,...
2026.04
21.63
QWEN3-4B-INSTRUCT-2507
Training Status=base
2026.04
14.88
QWEN3-4B-INSTRUCT-2507
Training Status=SFT (π...
2026.04
13.58
Feedback
Search any
task
Search any
task