Text-Conditioned Resampler For Long Form Video Understanding
About
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | NExT-QA (val) | Overall Acc73.5 | 176 | |
| Moment Query | Ego4D Moment Query (val) | R@1 (IoU=0.5)43.72 | 19 | |
| Repetitive Action Counting | Countix (test) | MAE0.33 | 8 | |
| Long-form Video Question Answering | EgoSchema EGO4D | QA Accuracy35.1 | 7 | |
| Long-term Action Anticipation | EGO4D LTA challenge (val) | Verb ED0.6585 | 6 |