Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

About

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.

Haodong Zhao, Tianyi Xu, Tianhang Zhao, Zhuosheng Zhang, Gongshen Liu• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringCoQA
CACC74.9
64
Question AnsweringFreebaseQA
ASR0.00e+0
64
Question AnsweringWebQA
ASR0.00e+0
64
Poisoned sample identificationWebQA
Recall100
12
Poisoned sample identificationFreebaseQA
Recall100
12
Poisoned sample identificationCoQA
Recall100
12
Poisoned sample identificationNQ
Recall100
12
Clean sample identificationWebQA
Accuracy89.36
3
Clean sample identificationFreebaseQA
Accuracy99.94
3
Clean sample identificationCoQA
Accuracy99.94
3
Showing 10 of 11 rows

Other info

GitHub

Follow for update