| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context evaluation (Financial) | Loong Fin | Fin Judge Score58.8 | 13 | |
| Overall | Loong Set 4: 200K–250K Tokens | LLM Score54.62 | 12 | |
| Chain-of-reasoning | Loong Set 4: 200K–250K Tokens | LLM Score36.17 | 12 | |
| Clustering | Loong Set 4: 200K–250K Tokens | LLM Score57.53 | 12 | |
| Comparison | Loong Set 4: 200K–250K Tokens | LLM Score55.8 | 12 | |
| Spotting | Loong Set 4: 200K–250K Tokens | LLM Score57.74 | 12 | |
| Overall | Loong Set 3: 100K–200K Tokens | LLM Score58.86 | 12 | |
| Chain-of-reasoning | Loong Set 3: 100K–200K Tokens | LLM Score0.5217 | 12 | |
| Clustering | Loong Set 3: 100K–200K Tokens | LLM Score58.85 | 12 | |
| Comparison | Loong Set 3: 100K–200K Tokens | LLM Score57.84 | 12 | |
| Spotting | Loong Set 3: 100K–200K Tokens | LLM Score0.6862 | 12 | |
| Overall | Loong Set 2: 50K–100K Tokens | LLM Score0.6361 | 12 | |
| Chain-of-reasoning | Loong Set 2: 50K–100K Tokens | LLM Score58.23 | 12 | |
| Clustering | Loong Set 2: 50K–100K Tokens | LLM Score61.67 | 12 | |
| Comparison | Loong Set 2: 50K–100K Tokens | LLM Score64.34 | 12 | |
| Spotting | Loong Set 2: 50K–100K Tokens | LLM Score69.92 | 12 | |
| Overall | Loong Set 1: 10K–50K Tokens | LLM Score71 | 12 | |
| Chain-of-reasoning | Loong Set 1: 10K–50K Tokens | LLM Score70.31 | 12 | |
| Clustering | Loong Set 1: 10K–50K Tokens | LLM Score0.6536 | 12 | |
| Comparison | Loong Set 1: 10K–50K Tokens | LLM Score75.65 | 12 | |
| Spotting | Loong Set 1: 10K–50K Tokens | LLM Score0.766 | 12 | |
| Long-Context Reasoning | LOONG | Accuracy65.43 | 11 | |
| Structured Information Extraction | Loong Finance (test) | Spotlight Locating (AS)83.97 | 10 | |
| Structured output generation for long-document QA | Loong Finance | Spotlight Locating AS84.42 | 9 | |
| Structured Data Extraction and Reasoning | Loong | Spotlight Locating Accuracy (AS)73.95 | 8 |