Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TRACER: Persistent Regularization for Robust Multimodal Finetuning

About

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet V2
Top-1 Acc78.54
749
Image ClassificationImageNet A
Top-1 Acc74.87
698
Image ClassificationImageNet-Sketch
Top-1 Accuracy53.69
473
Image ClassificationObjectNet
Accuracy69.76
251
Image ClassificationImageNet Rendition
Top-1 Accuracy79.33
113
Image ClassificationImageNet-Sketch
Accuracy63.71
89
Image ClassificationImageNet-Sketch--
63
Image ClassificationImageNet (INet)
Accuracy86.27
62
Image ClassificationImageNet (val)--
46
Image ClassificationImageNet-Adversarial
Top-1 Acc54.92
39
Showing 10 of 17 rows

Other info

Follow for update