Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reconstruction as a Bridge for Event-Based Visual Question Answering

About

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi• 2025

Related benchmarks

TaskDatasetResultRank
Event-based Video Question AnsweringEvQA 1000 Questions (full)
Accuracy76.1
28
Event-based Video Question AnsweringEvQA-Sparse 200 Questions
Accuracy66
28
Event-based Visual Question AnsweringEvQA
Total Accuracy0.673
3
Showing 3 of 3 rows

Other info

Follow for update