Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

About

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in the decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.

Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathQA
Accuracy80.4
95
Decision InferenceSMAC
Accuracy76.4
11
Decision InferenceMMLU
Accuracy0.772
11
Decision InferenceAI2-THOR
Success Rate70.2
7
Human EvaluationMathQA
Accuracy89.2
3
Showing 5 of 5 rows

Other info

Follow for update