Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-document Question Answering on Loong
Loading...
68
Average Score
Baseline
0.972
18.3735
35.775
53.1765
Mar 9, 2026
Mar 13, 2026
Mar 18, 2026
Mar 23, 2026
Mar 27, 2026
Apr 1, 2026
Apr 6, 2026
Average Score
Average Cost ($)
Score per Dollar
Updated 13d ago
Evaluation Results
Method
Method
Links
Average Score
Average Cost ($)
Score per Dollar
Baseline
Context=Full Context
2026.03
68
0.273
249.1
SPD-RAG
2026.03
58.1
0.103
564.1
ProxyCoT-RL
Model=Qwen3-4B-Instruct
2026.04
42.51
40.83
-
Normal RAG
2026.03
33
0.08
412.5
Agentic RAG
2026.03
32.8
0.098
334.7
Qwen3-4B-Instruct
Training=Zero-shot
2026.04
24.91
37.76
-
ProxyCoT-RL
Model=Gemma3-4B-IT
2026.04
24.32
32.05
-
Gemma3-4B-IT
Training=Zero-shot
2026.04
3.55
25.85
-
Feedback
Search any
task
Search any
task