Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

About

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu, Jing Yang, Simon Ostermann• 2026

Related benchmarks

TaskDatasetResultRank
Counterfactual GenerationSIB200
SLFR86.7
85
Counterfactual GenerationTaxi1500
SLFR93.7
67
Multilingual EvaluationSIB200
HLFR88.9
56
Multilingual EvaluationTaxi1500
HLFR75.7
56
EvaluationSIB200 (test)
SLFR74
24
EvaluationTAXI1500 (test)
SLFR64.9
12
Showing 6 of 6 rows

Other info

Follow for update