| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Loong Set 4: 200K–250K Tokens | Disco-RAG | LLM Score54.62 | 12 | 4d ago | |
| Loong Set 3: 100K–200K Tokens | Disco-RAG | LLM Score58.86 | 12 | 4d ago | |
| Loong Set 2: 50K–100K Tokens | Disco-RAG | LLM Score0.6361 | 12 | 4d ago | |
| Loong Set 1: 10K–50K Tokens | Disco-RAG | LLM Score71 | 12 | 4d ago | |
| Principle-based evaluation dataset | Average8.41 | 12 | 4d ago | ||
| puzzle 4x6 | TRL | Success Rate5,100 | 10 | 4d ago | |
| puzzle 4x5 | TRL | Success Rate9,700 | 10 | 4d ago | |
| humanoidmaze giant | Success Rate79 | 10 | 4d ago | ||
| PDDLLM v1 (test) | Planning Success Rate95.7 | 6 | 4d ago | ||
| Standard Evaluation Suite | MiniCPM-SALA | Average Score0.7653 | 6 | 4d ago | |
| AffordPose refined (test) | CRM-PPO | Success Rate95 | 4 | 4d ago | |
| SemEval 2017 Task 10 ScienceIE (test) | SCIIE | Precision48.1 | 2 | 4d ago |