Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
About
The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Event Mention Identification | M2E2 multimedia | F1 Score60 | 15 | |
| Argument Role Extraction | M2E2 multimedia | F1 Score21.2 | 15 | |
| Event Mention Identification | M2E2 image-only | Precision (%)69.6 | 14 | |
| Argument Role Extraction | M2E2 image-only | Precision3.3 | 14 | |
| Argument Role Extraction | M2E2 text-only | Precision75 | 13 | |
| Event Mention Identification | M2E2 text-only | Precision13.3 | 13 |