All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting
About
Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stock Ranking | Stock Ranking | OLR1.7 | 13 | |
| Legal Outcome | Legal Outcome | OLR0.8 | 10 | |
| Salary Prediction | Salary Prediction | OLR2.6 | 10 | |
| Legal Prediction | Legal | BS0.228 | 3 | |
| Salary Prediction | Salary Prediction | OLR5.3 | 3 | |
| Legal Prediction | Legal Prediction | OLR0.008 | 3 | |
| Salary Prediction | Salary | Relative Error37.9 | 3 | |
| Stock Ranking | Stock | Spearman's ρ0.167 | 3 |