| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context evaluation (Financial) | Loong Fin | Fin Judge Score58.8 | 13 | |
| Overall | Loong Set 4: 200K–250K Tokens | LLM Score54.62 | 12 | |
| Chain-of-reasoning | Loong Set 4: 200K–250K Tokens | LLM Score36.17 | 12 | |
| Clustering | Loong Set 4: 200K–250K Tokens | LLM Score57.53 | 12 | |
| Comparison | Loong Set 4: 200K–250K Tokens | LLM Score55.8 | 12 | |
| Spotting | Loong Set 4: 200K–250K Tokens | LLM Score57.74 | 12 | |
| Overall | Loong Set 3: 100K–200K Tokens | LLM Score58.86 | 12 | |
| Chain-of-reasoning | Loong Set 3: 100K–200K Tokens | LLM Score0.5217 | 12 | |
| Clustering | Loong Set 3: 100K–200K Tokens | LLM Score58.85 | 12 | |
| Comparison | Loong Set 3: 100K–200K Tokens | LLM Score57.84 | 12 | |
| Spotting | Loong Set 3: 100K–200K Tokens | LLM Score0.6862 | 12 | |
| Overall | Loong Set 2: 50K–100K Tokens | LLM Score0.6361 | 12 | |
| Chain-of-reasoning | Loong Set 2: 50K–100K Tokens | LLM Score58.23 | 12 | |
| Clustering | Loong Set 2: 50K–100K Tokens | LLM Score61.67 | 12 | |
| Comparison | Loong Set 2: 50K–100K Tokens | LLM Score64.34 | 12 | |
| Spotting | Loong Set 2: 50K–100K Tokens | LLM Score69.92 | 12 | |
| Overall | Loong Set 1: 10K–50K Tokens | LLM Score71 | 12 | |
| Chain-of-reasoning | Loong Set 1: 10K–50K Tokens | LLM Score70.31 | 12 | |
| Clustering | Loong Set 1: 10K–50K Tokens | LLM Score0.6536 | 12 | |
| Comparison | Loong Set 1: 10K–50K Tokens | LLM Score75.65 | 12 | |
| Spotting | Loong Set 1: 10K–50K Tokens | LLM Score0.766 | 12 |