Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

About

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, Zhiyong Lu• 2023

Related benchmarks

TaskDatasetResultRank
Medical Question AnsweringMedMCQA
Accuracy73.85
521
Medical Question AnsweringPubMedQA
Accuracy50.2
117
Medical Question AnsweringMMLU Med
Accuracy89.81
86
Medical Question AnsweringBioASQ
Accuracy81.39
63
Information RetrievalTREC-COVID
NDCG@1069.7
59
Semantic Textual SimilarityBIOSSES
Spearman Correlation83.7
55
Information RetrievalSciFact
nDCG@100.724
51
Information RetrievalCOVID
nDCG@1054.66
50
Medical Question AnsweringMedQA US
Accuracy80.68
43
Information RetrievalNFCorpus
nDCG@1028.43
33
Showing 10 of 21 rows

Other info

Follow for update