Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

About

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.

Zeyu Zhang, Ryan Chen, Bradly C. Stadie• 2026

Related benchmarks

TaskDatasetResultRank
Stock RankingStock Ranking
OLR1.7
13
Legal OutcomeLegal Outcome
OLR0.8
10
Salary PredictionSalary Prediction
OLR2.6
10
Legal PredictionLegal
BS0.228
3
Salary PredictionSalary Prediction
OLR5.3
3
Legal PredictionLegal Prediction
OLR0.008
3
Salary PredictionSalary
Relative Error37.9
3
Stock RankingStock
Spearman's ρ0.167
3
Showing 8 of 8 rows

Other info

Follow for update