MARVIS: Modality Adaptive Reasoning over VISualizations
About
Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular domains or modalities. Such models often achieve excellent performance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a system that transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for predictions successfully. MARVIS achieves competitive performance across vision, audio, biological, and tabular domains using a single 3B parameter model, yielding results that beat Gemini 2.0 by 16% on average. MARVIS drastically reduces the gap between LLM/VLMs approaches and specialized domain-specific methods, without requiring any domain-specific training. Code and datasets are available at https://github.com/penfever/marvis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 | Accuracy98 | 875 | |
| Audio Classification | ESC-50 | Accuracy91.3 | 441 | |
| Leaf Disease Classification | PlantDoc | Accuracy67.4 | 21 | |
| Tabular Classification | OpenML CC18 | Mean Accuracy84.5 | 12 | |
| Regression | OpenML Regression | Mean R253.2 | 7 | |
| Biological Image Classification | FishNet | Accuracy80.2 | 3 | |
| Image Classification | CIFAR-100 | Accuracy88 | 3 | |
| Tabular Regression | Regression 2025 | R2 Score66 | 3 | |
| Biological Image Classification | AWA2 | Accuracy95.7 | 3 |