LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

About

Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.

Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Image Classification	Flowers102	Accuracy3.7	558
Image Classification	Food101	Accuracy20.8	457
Scientific Question Answering	ScienceQA image	Accuracy69.4	259
Image Classification	Caltech101	Accuracy43.5	228
Multimodal Model Evaluation	MMBench	Accuracy64	204
Visual Perception	MMVP	Accuracy61.3	118
Multimodal Model Evaluation	MME	Total Score1.75e+3	71
Vision-centric Evaluation	CV-Bench	Accuracy0.474	21
Visual Question Answering	TextVQA	Accuracy49.2	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord