Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

About

Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

Jiayi Li, Shijie Tang, G\"un Kaynar, Shiyi Du, Carl Kingsford• 2026

Related benchmarks

Task	Dataset	Result
Sentiment Classification	SST2 (test)	Accuracy91.9	233
Natural Language Inference	MultiNLI (test)	--	81
Toxicity Detection	CivilComments (test)	WGA74.8	14
Emotion Classification	GoEmo-ST	Accuracy62.7	5
Natural Language Inference	MultiNLI controlled shortcut injection	Accuracy32.3	5
Sentiment Analysis	Yelp-ST	Accuracy48.8	5
Sentiment Analysis	Yelp-Syn	Accuracy53	5
Text Classification	CivilComments controlled shortcut injection	Accuracy57.2	5
Natural Language Inference	MultiNLI reconstructed with controlled shortcut injection (test)	MSTPS0.381	5
Emotion Classification	GoEmo-Syn	Accuracy60.7	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord