RARR: Researching and Revising What Language Models Say, Using Language Models
About
Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-document summarization | MDS | Length843.6 | 14 | |
| Long-form Question Answering | ALCE LFQA | ROUGE-L35.2 | 7 | |
| Scientific Fact-Checking | BIONLI 300 | Balanced Accuracy66.4 | 7 | |
| Scientific Fact-Checking | CLIMATE-FEVER 2-way Supported Refuted subsets | Balanced Accuracy70.4 | 7 | |
| Cell-level attribution | FetaQA (gold set) | Precision0.2005 | 6 | |
| Cell-level attribution | AITQA | Precision31.96 | 6 | |
| Column-Level Attribution | ToTTo | Precision90.51 | 6 | |
| Row-Level Attribution | AITQA | Precision66.82 | 6 | |
| Scientific Fact-Checking | PubMedFact1k 3-way | Macro F172.3 | 6 | |
| Cell-level attribution | ToTTo (gold set) | Precision20.51 | 6 |