Can Post-Training Transform LLMs into Causal Reasoners?

About

Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.

Junqi Chen, Sirui Chen, Chaochao Lu• 2026

Related benchmarks

Task	Dataset	Result
Causal Inference	CounterBench	Accuracy64.6	54
Causal Reasoning	Corr2Cause	Accuracy39.4	22
Causal Reasoning	CLadder	Accuracy76.7	20
Causal Reasoning	ExecCF	Accuracy69.6	14
Causal Reasoning	Com2	Accuracy77.2	14
Causal Reasoning	CaLM	Accuracy67.1	14
Causal Reasoning	BBEH	Accuracy (Causal Reasoning)50	14
Causal Inference	CaLM (test)	ATE0.99	12
Causal Reasoning	CaLM Mathematical	Accuracy93.5	3
Causal Reasoning	CausalProbe-E	Accuracy80.5	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord