Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
About
We observe that current conversational language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a \textsc{Follow-up Questioning Mechanism} along with two metrics to quantify this inconsistency, confirming its widespread presence in current language models. To mitigate this issue, we explore various prompting strategies for closed-source models; moreover, we develop a training-based framework \textsc{Unwavering-FQ} that teaches language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Follow-up Questioning Consistency | MultiArith (unseen) | Average Success Count18.33 | 12 | |
| Follow-up Questioning Consistency | StrategyQA (unseen) | Average Success Count (M.)13.25 | 12 | |
| Judgment Consistency | CoinFlip (unseen) | Baseline Score51.8 | 9 | |
| Follow-up Questioning Consistency | CoinFlip (unseen) | Baseline Consistency Score52.2 | 3 |