| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Decoding Stability | Causal and Downstream Robustness Ablation Suite Averaged over 4 models | Decoding Δ%0.8 | 14 | |
| Span Extraction | Causal and Downstream Robustness Ablation Suite | Span F181 | 14 | |
| Tool Use | Causal and Downstream Robustness Ablation Suite Averaged over 4 models | Tool Hit@1Δ4.1 | 14 | |
| Fact-checking | Causal and Downstream Robustness Ablation Suite Averaged over 4 models | Fact EMΔ3.7 | 14 | |
| Causal Attribution | Causal and Downstream Robustness Ablation Suite Averaged over LLaMA-3.1 70B, Phi-3 14B, GPT-J 6B, Qwen2.5 3B | Causal Pass@586 | 14 |