Locking Pretrained Weights via Deep Low-Rank Residual Distillation

About

The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.

Keitaro Sakamoto, Pierre Ablin, Federico Danieli, Marco Cuturi• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy55.2	1442
Question Answering	ARC Challenge	Accuracy (ARC)29	598
Multi-task Language Understanding	MMLU	MMLU Accuracy36.8	442
Commonsense Reasoning	PIQA	Accuracy65.3	213
Question Answering	ARC Easy	Accuracy50.7	210
Question Answering	BoolQ	Accuracy63.8	201
Language Modeling	WikiText-103	Perplexity23.4	17
Language Modeling	Nemotron	Perplexity14.9	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord