Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Code Representation Learning At Scale

About

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang• 2024

Related benchmarks

TaskDatasetResultRank
NL2Code SearchCSN (CodeSearchNet) (test)
Recall (Python)70.77
18
File-level Code LocalizationSWE-Bench Lite
Acc@147.81
16
File-level LocalizationSWE-Bench-Lite latest (test)
NDCG@147.81
16
Module-level Code LocalizationSWE-Bench Lite
Acc@560.58
16
Function-level Code LocalizationSWE-Bench Lite
Acc@533.94
16
Function-level LocalizationSWE-Bench-Lite latest (test)
NDCG@527.03
16
Module-level LocalizationSWE-Bench-Lite latest (test)
NDCG@549.38
16
Code2Code SearchCode2Code Search (test)
Python46.7
7
NL2Code SearchAdv (test)
MRR52.67
7
NL2Code SearchCoSQA (test)
MRR47.53
7
Showing 10 of 11 rows

Other info

Follow for update