UniXcoder: Unified Cross-Modal Pre-training for Code Representation

About

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin• 2022

Related benchmarks

Task	Dataset	Result
Code Authorship Attribution	CoDET-M4	Accuracy82.7	43
Code Authorship Attribution	LLMAuthorBench	Accuracy94.02	43
Vulnerability Detection	BigVul	Precision92.17	42
Code Summarization	CodeXGLUE	Java Score20.31	38
Vulnerability Detection	PrimeVul (test)	F1 Score16.48	38
Vulnerability Detection	PrimeVul	F1 Score61.55	24
Vulnerability Detection	PrimeVul Paired (test)	Pair-Correct Count6	22
NL2Code Search	CSN (CodeSearchNet) (test)	Recall (Python)42.17	18
Vulnerability Detection	PrimeVul Paired (full)	PC Score1.6	13
Information Retrieval	CoIR (test)	Apps Score1.4	13

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord