Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

About

Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.

Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, Georg Rehm• 2022

Related benchmarks

TaskDatasetResultRank
Citation RecommendationScientific Paper Domains Natural science (test)
P@30.575
20
Citation RecommendationScientific Paper Domains Social science (test)
P@30.634
20
Citation RecommendationScientific Paper Domains Overall (test)
Precision@357.7
20
Node RetrievalICLR 2025 (500 papers)
Recall @ 90.157
16
Novelty EstimationAI-Researcher (test)
Pearson R0.169
15
Reviewer AssignmentLR-Bench
Loss (LR-PC)0.2354
14
Scientific Document Representation EvaluationSCIDOCS (test)
MAG F181.4
13
Transition RetrievalTransition Retrieval
Recall @ 99
7
Citation RecommendationACL-200 (test)
Recall@50.1517
5
Citation RecommendationArSyTa (test)
Recall@516.12
5
Showing 10 of 13 rows

Other info

Code

Follow for update