CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

About

Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at https://anonymous.4open.science/r/CFMS-E8F9.

Junzhao Zhang, Hsiu-Yuan Huang, Chenming Tang, Yutong Yang, Yunfang Wu• 2026

Related benchmarks

Task	Dataset	Result
Explanation Generation	CFMS	BLEU-48.76	17
Identification	CFMS	Accuracy78.01	17
Target Recognition	CFMS	Target Accuracy50.89	17

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord