Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Training-Free Test-Time Contrastive Learning for Large Language Models

About

Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic "Explore-Reflect-Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.

Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningGSM8K--
106
ReasoningMATH 500
Accuracy (%)54
90
Mathematical ReasoningMinerva
Accuracy (Acc)24.63
62
ReasoningAIME 24
Accuracy on AIME 2483.33
49
Text GenerationDomainBench Finance
BERTScore0.7235
15
Open-ended generationFinance
ROUGE-Lsum29.19
8
Closed-ended reasoningAIME24
Accuracy0.1333
7
Open-ended evaluationDomainBench (test)
Geography Score27.98
7
Text GenerationDomainBench Geography
BERTScore0.7082
7
Text GenerationDomainBench Medicine
BERTScore0.701
7
Showing 10 of 11 rows

Other info

Follow for update