Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents
About
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks, including search engines. However, existing work utilizes the generative ability of LLMs for Information Retrieval (IR) rather than direct passage ranking. The discrepancy between the pre-training objectives of LLMs and the ranking objective poses another challenge. In this paper, we first investigate generative LLMs such as ChatGPT and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal that properly instructed LLMs can deliver competitive, even superior results to state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to address concerns about data contamination of LLMs, we collect a new test set called NovelEval, based on the latest knowledge and aiming to verify the model's ability to rank unknown knowledge. Finally, to improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models using a permutation distillation scheme. Our evaluation results turn out that a distilled 440M model outperforms a 3B supervised model on the BEIR benchmark. The code to reproduce our results is available at www.github.com/sunnweiwei/RankGPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | Scientific QA Base setting | F1 Score51.95 | 38 | |
| Ranking | BEIR selected subset v1.0.0 (test) | TREC-COVID82.34 | 38 | |
| Information Retrieval | Scientific QA Base setting | HitRate@152 | 38 | |
| Reranking | BEIR | NQ NDCG@50.4563 | 35 | |
| Reranking | TREC | NDCG@5 (DL19)68.58 | 35 | |
| Abstract generation | LongLaMP | R142.5 | 32 | |
| Passage Ranking | NQ | MRR45.05 | 29 | |
| Recommendation | Goodreads (test) | HR@557.63 | 29 | |
| Passage Ranking | TREC DL 2019 | R@1090 | 28 | |
| Passage retrieval | Natural Questions (NQ) | Top-10 Accuracy58.33 | 28 |