FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

About

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee• 2026

Related benchmarks

Task	Dataset	Result
Harmful Meme Detection	FHM (test)	Accuracy76.2	51
Harmful Meme Detection	MAMI (test)	Accuracy81.9	51
Hateful Meme Detection	FBHM (test)	Accuracy78.42	41

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord