Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Can Post-Training Transform LLMs into Causal Reasoners?

About

Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.

Junqi Chen, Sirui Chen, Chaochao Lu• 2026

Related benchmarks

TaskDatasetResultRank
Causal InferenceCounterBench
Accuracy64.6
54
Causal ReasoningCorr2Cause
Accuracy39.4
22
Causal ReasoningCLadder
Accuracy76.7
20
Causal ReasoningExecCF
Accuracy69.6
14
Causal ReasoningCom2
Accuracy77.2
14
Causal ReasoningCaLM
Accuracy67.1
14
Causal ReasoningBBEH
Accuracy (Causal Reasoning)50
14
Causal InferenceCaLM (test)
ATE0.99
12
Causal ReasoningCaLM Mathematical
Accuracy93.5
3
Causal ReasoningCausalProbe-E
Accuracy80.5
3
Showing 10 of 11 rows

Other info

Follow for update