MARVIS: Modality Adaptive Reasoning over VISualizations

About

Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular domains or modalities. Such models often achieve excellent performance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a system that transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for predictions successfully. MARVIS achieves competitive performance across vision, audio, biological, and tabular domains using a single 3B parameter model, yielding results that beat Gemini 2.0 by 16% on average. MARVIS drastically reduces the gap between LLM/VLMs approaches and specialized domain-specific methods, without requiring any domain-specific training. Code and datasets are available at https://github.com/penfever/marvis.

Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	Accuracy98	973
Audio Classification	ESC-50	Accuracy91.3	461
Leaf Disease Classification	PlantDoc	Accuracy67.4	21
Tabular Classification	OpenML CC18	Mean Accuracy84.5	12
Regression	OpenML Regression	Mean R253.2	7
Biological Image Classification	FishNet	Accuracy80.2	3
Image Classification	CIFAR-100	Accuracy88	3
Tabular Regression	Regression 2025	R2 Score66	3
Biological Image Classification	AWA2	Accuracy95.7	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord