UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

About

Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.

Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, Qi Zhang• 2023

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy54.3	1896
Natural Language Inference	RTE	Accuracy55.2	590
Question Answering	ARC-E	Accuracy64.1	523
Question Answering	OBQA	Accuracy49.8	347
Text Classification	TREC	Accuracy83.2	281
Question Answering	ARC-C	Accuracy32.9	258
Common Sense Reasoning	COPA	Accuracy72	256
Natural Language Inference	SNLI	Accuracy75.5	196
Sentiment Analysis	SST-5	Accuracy54.3	123
Text Classification	AGNews	Accuracy90.7	110

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord