On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation
About
Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Funny Caption Generation | Humor in AI (#Top10) | Top-1 Accuracy66.41 | 32 | |
| Funny Caption Generation | Humor in AI (#200-209) | Recall@173.4 | 32 | |
| Funny Caption Generation | Humor in AI (#1000-1009) | Top-1 Accuracy76.32 | 32 | |
| Funny Caption Generation | Electric sheep High-Humor | Recall@175.53 | 32 | |
| Funny Caption Generation | Electric sheep Low-Humor | Top-1 Accuracy79.45 | 32 | |
| Humor Generation | Electronic Sheep (test) | Visual Understanding Avg Rank1.8 | 8 | |
| Humorous Caption Generation | Humor in AI Dataset (test) | Visual Understanding Rank2.5 | 8 | |
| Meme Caption Generation | ImgFlip | pass@183.33 | 8 | |
| Meme Generation | Meme ImgFlip (test) | Pass@183.33 | 8 | |
| Funny Caption Generation | Humor in AI (test) | p-value0.00e+0 | 7 |