When Benchmarks Leak: Inference-Time Decontamination for LLMs

About

Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.

Jianzhe Chai, Yu Zhe, Jun Sakuma• 2026

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU o=1 Exact split	Accuracy73	42
Multitask Language Understanding	MMLU Exact split, o=3	Accuracy78.5	42
Language Understanding	MMLU o=1 (Semantic-level)	Accuracy72.5	21
Question Answering	TruthfulQA o=1 Domain-level split	Accuracy65.5	21
Question Answering	TruthfulQA Semantic-level split o=3	Accuracy63.2	21
Question Answering	TruthfulQA Domain-level split, o=3	Accuracy64.1	21
Question Answering	TruthfulQA o=1 Semantic-level	Accuracy66.4	21
Question Answering	TruthfulQA o=1 (Exact split)	Accuracy61.8	21
Multitask Language Understanding	MMLU Semantic-level split, o=3	Accuracy63.2	21
Question Answering	TruthfulQA Exact split, o=3	Accuracy62	21

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord