VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
About
This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Vulnerability Detection | BigVul | Precision81.49 | 42 | |
| Vulnerability Detection | CASTLE 250-sample benchmark 1.0 | CASTLE Score-12 | 29 | |
| Vulnerability Detection | SVEN | F1 Score0.44 | 14 | |
| Vulnerability Detection | VulDeePecker | F1 Score95.26 | 12 | |
| Vulnerability Detection | Reveal | Accuracy84.5 | 12 | |
| Vulnerability Detection | Draper | F1 Score57.9 | 7 | |
| Vulnerability Detection | SVEN (CWE-125, CWE-190, CWE-416, CWE-476) (test) | Accuracy50 | 7 | |
| Vulnerability Detection | Devign | True Positives (TP)638 | 4 | |
| Vulnerability Detection | DiverseVul | True Positives (TP)176 | 4 |