Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization

About

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu, Jing Yang, Simon Ostermann• 2026

Related benchmarks

Task	Dataset	Result
Counterfactual Generation	SIB200	SLFR86.7	85
Counterfactual Generation	Taxi1500	SLFR93.7	67
Multilingual Evaluation	SIB200	HLFR88.9	56
Multilingual Evaluation	Taxi1500	HLFR75.7	56
Evaluation	SIB200 (test)	SLFR74	24
Evaluation	TAXI1500 (test)	SLFR64.9	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord