Ask Again, Then Fail: Large Language Models' Vacillations in Judgment

About

We observe that current conversational language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a \textsc{Follow-up Questioning Mechanism} along with two metrics to quantify this inconsistency, confirming its widespread presence in current language models. To mitigate this issue, we explore various prompting strategies for closed-source models; moreover, we develop a training-based framework \textsc{Unwavering-FQ} that teaches language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of models.

Qiming Xie, Zengzhi Wang, Yi Feng, Rui Xia• 2023

Related benchmarks

Task	Dataset	Result
Follow-up Questioning Consistency	MultiArith (unseen)	Average Success Count18.33	12
Follow-up Questioning Consistency	StrategyQA (unseen)	Average Success Count (M.)13.25	12
Judgment Consistency	CoinFlip (unseen)	Baseline Score51.8	9
Fairness and Utility Evaluation	Fairness and Utility Benchmarks (BBQ, UnQover, CEB-Adult, CEB-Credit, CEB-Jigsaw, CrowS, ARC-C, GSM8K)	BBQ Accuracy97.2	8
Follow-up Questioning Consistency	CoinFlip (unseen)	Baseline Consistency Score52.2	3

Showing 5 of 5 rows

Other info

Code

Follow for update

@wizwand_team Discord