Reconstruction as a Bridge for Event-Based Visual Question Answering

About

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi• 2025

Related benchmarks

Task	Dataset	Result
Event-based Video Question Answering	EvQA 1000 Questions (full)	Accuracy76.1	28
Event-based Video Question Answering	EvQA-Sparse 200 Questions	Accuracy66	28
Event-based Visual Question Answering	EvQA	Total Accuracy0.673	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord