All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

About

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.

Zeyu Zhang, Ryan Chen, Bradly C. Stadie• 2026

Related benchmarks

Task	Dataset	Result
Stock Ranking	Stock Ranking	OLR1.7	13
Legal Outcome	Legal Outcome	OLR0.8	10
Salary Prediction	Salary Prediction	OLR2.6	10
Legal Prediction	Legal	BS0.228	3
Salary Prediction	Salary Prediction	OLR5.3	3
Legal Prediction	Legal Prediction	OLR0.008	3
Salary Prediction	Salary	Relative Error37.9	3
Stock Ranking	Stock	Spearman's ρ0.167	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord