Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
About
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs' capacity to generate comprehensive and high-quality responses. Prior RAG studies on the robustness of retrieval noises often confine themselves to a limited set of noise types, deviating from real-world retrieval environments and limiting practical applicability. In this study, we initially investigate retrieval noises and categorize them into three distinct types, reflecting real-world environments. We analyze the impact of these various retrieval noises on the robustness of LLMs. Subsequently, we propose a novel RAG approach known as Retrieval-augmented Adaptive Adversarial Training (RAAT). RAAT leverages adaptive adversarial training to dynamically adjust the model's training process in response to retrieval noises. Concurrently, it employs multi-task learning to ensure the model's capacity to internally recognize noisy contexts. Extensive experiments demonstrate that the LLaMA-2 7B model trained using RAAT exhibits significant improvements in F1 and EM scores under diverse noise conditions. For reproducibility, we release our code and data at: https://github.com/calubkk/RAAT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | F1 Score33.3 | 221 | |
| Question Answering | PubMedQA (test) | Accuracy46.8 | 81 | |
| Multi-hop Question Answering | HotpotQA | SubEM33.58 | 40 | |
| Open-domain Question Answering | NaturalQuestions (NQ) | SubEM50.12 | 40 | |
| Open-domain Question Answering | TriviaQA | SubEM68.54 | 40 | |
| Question Answering | NQ, TriviaQA, and WebQ (test) | Accuracy46.2 | 21 | |
| Retrieval-Augmented Generation | RAG-Bench | F1 (Golden Only)87.15 | 11 | |
| Retrieval-Augmented Generation | PubMedQA | Accuracy46.6 | 8 | |
| Question Answering | ConFiQA-QA counterfactual contexts | Accuracy43.5 | 7 | |
| Retrieval-Augmented Generation | BioASQ | Accuracy64.9 | 5 |