Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

About

The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.

Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy55.2
1442
Question AnsweringARC Challenge
Accuracy (ARC)29
598
Multi-task Language UnderstandingMMLU
MMLU Accuracy36.8
442
Commonsense ReasoningPIQA
Accuracy65.3
213
Question AnsweringARC Easy
Accuracy50.7
210
Question AnsweringBoolQ
Accuracy63.8
201
Language ModelingWikiText-103
Perplexity23.4
17
Language ModelingNemotron
Perplexity14.9
3
Showing 8 of 8 rows

Other info

Follow for update