MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

About

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, Zhiyong Lu• 2023

Related benchmarks

Task	Dataset	Result
Medical Question Answering	MedMCQA	Accuracy73.85	591
Medical Question Answering	MedQA	Accuracy74.3	145
Medical Question Answering	PubMedQA	Accuracy50.2	122
Medical Question Answering	MMLU Med	Accuracy89.81	111
Medical Question Answering	BioASQ	Accuracy81.39	63
Information Retrieval	TREC-COVID	NDCG@1069.7	59
Semantic Textual Similarity	BIOSSES	Spearman Correlation83.7	55
Information Retrieval	SciFact	nDCG@100.724	51
Information Retrieval	COVID	nDCG@1054.66	50
Medical Question Answering	MedQA US	Accuracy80.68	43

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord