VIBE: Video-Input Brain Encoder for fMRI Response Modeling

About

We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

Daniel Carlstr\"om Schad, Shrey Dixit, Janis Keck, Viktor Studenyak, Aleksandr Shpilevoi, Andrej Bicanski• 2025

Related benchmarks

Task	Dataset	Result	Rank
Brain response prediction	Friends In-distribution s07 (test)	Mean Pearson r0.32		13
Brain response prediction	OOD benchmark	Mean Pearson r0.21		13

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord