UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
About
Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy54.3 | 1460 | |
| Natural Language Inference | RTE | Accuracy55.2 | 367 | |
| Question Answering | OBQA | Accuracy49.8 | 276 | |
| Question Answering | ARC-E | Accuracy64.1 | 242 | |
| Natural Language Inference | SNLI | Accuracy75.5 | 174 | |
| Question Answering | ARC-C | Accuracy32.9 | 166 | |
| Common Sense Reasoning | COPA | Accuracy72 | 138 | |
| Sentiment Analysis | SST-5 | Accuracy52.6 | 47 | |
| Natural Language Inference | QNLI | Accuracy72.5 | 42 | |
| Summarization | Gigaword | ROUGE-L25.8 | 38 |