Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

About

The proliferation of large language models for code (CodeLMs) and open-source contributions has heightened concerns over unauthorized use of source code datasets. While watermarking provides a viable protection mechanism by embedding ownership signals, existing methods rely on detectable trigger-target patterns and are limited to source-code tasks, overlooking other scenarios such as decompilation tasks. In this paper, we propose DuCodeMark, a stealthy and robust dual-purpose watermarking method for code datasets that generalizes across both source-code tasks and decompilation tasks. DuCodeMark parses each code sample into an abstract syntax tree (AST), applies language-specific style transformations to construct stealthy trigger-target pairs, and injects repressible poisoned features into a subset of return-typed samples to enhance robustness against watermark removal or evasion. These features remain inactive during normal training but are activated upon watermark removal, degrading model performance. For verification, DuCodeMark employs a black-box method based on the independent-samples $t$-test. We conduct a comprehensive evaluation of DuCodeMark across 72 settings spanning two code tasks, two programming languages, three CodeLMs, and six decoding temperatures. The results demonstrate that it consistently achieves strong verifiability ($p < 0.05$), high stealthiness (suspicious rate $\leq$ 0.36), robustness against both watermark and poisoning attacks (recall $\leq$ 0.57), and a substantial drop in model performance upon watermark removal (Pass@1 drops by 28.6%), underscoring its practicality and resilience.

Yuchen Chen, Yuan Xiao, Chunrong Fang, Zhenyu Chen, Baowen Xu• 2026

Related benchmarks

TaskDatasetResultRank
Code CompletionC code dataset
FPR7
16
Code CompletionJava code dataset
False Positive Rate (FPR)7
16
Code DecompilationC code dataset
FPR7
16
Code DecompilationJava code dataset
FPR7
16
Stealthiness Evaluation (Human Inspection)Code Review Dataset C
Suspicious Rate8
10
Stealthiness Evaluation (Human Inspection)Code Review Dataset Java
Suspicious Rate8
10
Watermark Robustness EvaluationDuCodeMark GPT-4o
Accuracy100
5
Watermark Robustness EvaluationDuCodeMark CodeLlama-Instruct-7B
Accuracy94
5
Watermark Robustness EvaluationDuCodeMark Clang-format
Accuracy0.00e+0
5
Watermark Robustness EvaluationDuCodeMark CodeQL
Accuracy0.00e+0
5
Showing 10 of 13 rows

Other info

Follow for update