| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Evaluation Dataset (Unseen Average) | Mistral-7B | Score42.86 | 18 | 3mo ago | |
| Evaluation Dataset Seen Average | Mistral-7B | Score62.34 | 18 | 3mo ago | |
| Evaluation Dataset Unseen (Fold 3) | Qwen2.5-7B | Score0.4022 | 18 | 3mo ago | |
| Evaluation Dataset (Fold 3 Seen) | LLaMA-3-8B + COGLM | Score66.69 | 18 | 3mo ago | |
| Evaluation Dataset Unseen (Fold 2) | Mistral-7B | Score50 | 18 | 3mo ago | |
| Evaluation Dataset (Fold 2 Seen) | Gemma-7B + COGLM | Score63.63 | 18 | 3mo ago | |
| Evaluation Dataset Unseen (Fold 1) | DeepSeek V3 | Score0.4818 | 18 | 3mo ago | |
| Evaluation Dataset (Fold 1 Seen) | Mistral-7B | Score0.6191 | 18 | 3mo ago | |
| Evaluation Dataset (Full) | Gemma-7B + COGLM | Score0.6379 | 18 | 3mo ago | |
| COGS | GRPO-Binary | Exact Match Accuracy83.9 | 6 | 27d ago | |
| SCAN | GRPO-Composite | Length23.44 | 6 | 27d ago | |
| Integration | PRECEPT | P1 Score49 | 6 | 2mo ago | |
| Logistics | PRECEPT | P1100 | 6 | 2mo ago | |
| ColorMNIST 40 OOD compositions (Generalization) | PoE + LoRA | FID84.7 | 5 | 23d ago | |
| ColorMNIST Faithfulness 40 OOD compositions | PoE + LoRA | FID62.6 | 5 | 23d ago |