Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Causal and Downstream Robustness Ablation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Decoding StabilityCausal and Downstream Robustness Ablation Suite Averaged over 4 models
Decoding Δ%0.8
14
Span ExtractionCausal and Downstream Robustness Ablation Suite
Span F181
14
Tool UseCausal and Downstream Robustness Ablation Suite Averaged over 4 models
Tool Hit@1Δ4.1
14
Fact-checkingCausal and Downstream Robustness Ablation Suite Averaged over 4 models
Fact EMΔ3.7
14
Causal AttributionCausal and Downstream Robustness Ablation Suite Averaged over LLaMA-3.1 70B, Phi-3 14B, GPT-J 6B, Qwen2.5 3B
Causal Pass@586
14
Showing 5 of 5 rows