Adapter-state Sharing CLIP for Parameter-efficient Multimodal Sarcasm Detection

About

The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining systems. Existing approaches rely on full fine-tuning of large models, making them unsuitable to adapt under resource-constrained settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS-CLIP (Adapter-state Sharing in CLIP), a lightweight framework built on CLIP that inserts adapters only in the upper layers to preserve low-level unimodal representations in the lower layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones to promote efficient cross-modal learning in the upper layers. Experiments on two public benchmarks demonstrate that AdS-CLIP not only outperforms standard PEFT methods but also existing multimodal baselines with significantly fewer trainable parameters.

Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh• 2025

Related benchmarks

Task	Dataset	Result	Rank
Multi-modal sarcasm detection	MMSD 2.0	Accuracy85.6		37

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord