No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation
About
Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and quantify it by measuring accuracy drops on baseline-correct items under answer-consistent contexts. We propose No-Worse Context-Aware Decoding (NWCAD), a decode-time adapter built on a two-stream setup with a two-stage gate: it backs off to no-context decoding when the context is non-informative, and otherwise uses context-conditioned decoding with a CAD-style fallback under uncertainty. We evaluate NWCAD on benchmarks that separate do-no-harm reliability from context utilization (accuracy gains on genuinely helpful contexts). NWCAD prevents neutral regression on baseline-correct items while preserving strong context-driven accuracy on helpful contexts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | PopQA | Accuracy87.12 | 103 | |
| Table Question Answering | TabMWP | Accuracy63.6 | 97 | |
| Question Answering | NQ-Open (val) | Accuracy49.62 | 46 | |
| Question Answering | NQ-Swap | Accuracy73 | 38 | |
| Dialogue Summarization | TofuEval | ToFuEval Score83.12 | 18 | |
| Long-form Question Answering | ExpertQA | ROUGE-L23.34 | 18 | |
| Question Answering | Restate hard | Accuracy94.4 | 18 | |
| Question Answering | Distractor hard | Accuracy (Distractor hard)62.2 | 18 | |
| Question Answering | HELPFUL | Accuracy90.21 | 18 | |
| Question Answering | NQ SYNTH | Accuracy79 | 18 |