| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context reasoning | OOLONG | Accuracy68.4 | 37 | |
| Long-context reasoning | OOLONG trec_coarse | Score86.6 | 28 | |
| Long-context reasoning | OOLONG | Latency (s)7.1 | 27 | |
| Long-Context Reasoning | Oolong-Synth | Accuracy78.41 | 11 | |
| Long-document Question Answering | Oolong | Accuracy68 | 10 | |
| Reasoning | Oolong real 2025 (test) | Score15.1 | 9 | |
| Reasoning | Oolong real | Score0.151 | 9 | |
| Long-context Question Answering | Oolong Real | Score37.46 | 9 | |
| Long-context Question Answering | Oolong Synthetic | Score71.75 | 8 | |
| Long-context Classification | OOLONG (test) | TREC-Q-coarse Accuracy58.1 | 6 | |
| Long-context reasoning | OOLONG-REAL 650 samples (175K bucket) | Average Reward0.249 | 2 |