Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

About

Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emph{answers} being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emph{answer-side} uncertainty complements preference-level weighting and yields more robust, data-efficient alignment. Codes are provided here.

Tiejin Chen, Xiaoou Liu, Vishnu Nandam, Kuan-Ru Liou, Hua Wei• 2026

Related benchmarks

Task	Dataset	Result
Preference Alignment Evaluation	Pairwise	Average Score92.12	18
Question Answering	WebGPT	Average Score76.42	18
Text Summarization	Summarize	Average Score67.39	18
Summarization	Summarize dataset	Win Rate0.6433	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord