Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Chain-of-reasoning on Loong Set 1: 10K–50K Tokens
Loading...
70.31
LLM Score
Llama-3.3-70B
53.6492
57.9746
62.3
66.6254
Jan 7, 2026
LLM Score
Exact Match
Updated 4d ago
Evaluation Results
Method
Method
Links
LLM Score
Exact Match
Llama-3.3-70B
Condition=Full Context
2026.01
70.31
37
Disco-RAG
Base Model=Llama-3.3-70B
2026.01
68.3
38
Disco-RAG
Base Model=Llama-3.1-8B
2026.01
68
34
StructRAG
Condition=SOTA Results
2026.01
67.84
34
Disco-RAG
Base Model=Qwen2.5-72B
2026.01
67.73
35
Qwen2.5-72B
Condition=Full Context
2026.01
66.51
36
Llama-3.3-70B
Condition=Standard RAG
2026.01
66.48
36
Llama-3.1-8B
Condition=Full Context
2026.01
65.66
37
Qwen2.5-72B
Condition=Standard RAG
2026.01
64.67
34
RQ-RAG
Condition=SOTA Results
2026.01
58.96
25
Llama-3.1-8B
Condition=Standard RAG
2026.01
58.76
32
GraphRAG
Condition=SOTA Results
2026.01
54.29
43
Feedback
Search any
task
Search any
task