Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

About

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in the decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.

Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathQA	Accuracy80.4	354
Decision Inference	SMAC	Accuracy76.4	11
Decision Inference	MMLU	Accuracy0.772	11
Decision Inference	AI2-THOR	Success Rate70.2	7
Human Evaluation	MathQA	Accuracy89.2	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord