UniXcoder: Unified Cross-Modal Pre-training for Code Representation
About
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Summarization | CodeXGLUE | Java Score20.31 | 38 | |
| Vulnerability Detection | PrimeVul (test) | F1 Score16.48 | 38 | |
| Vulnerability Detection | PrimeVul | F1 Score61.55 | 24 | |
| NL2Code Search | CSN (CodeSearchNet) (test) | Recall (Python)42.17 | 18 | |
| Vulnerability Detection | PrimeVul Paired (full) | PC Score1.6 | 13 | |
| Information Retrieval | CoIR (test) | Apps Score1.4 | 13 | |
| Vulnerability Detection | SVEN (test) | Accuracy51.9 | 12 | |
| Text-to-Code Retrieval | CodeSearchNet CodeXGLUE | Ruby Score74 | 9 | |
| Text-to-Code Retrieval | CodeXGLUE AdvTest | MRR41.3 | 9 | |
| Text-to-Code Retrieval | CosQA CodeXGLUE | MRR70.1 | 8 |