ARC Is a Vision Problem!
About
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Abstract Visual Reasoning | ARC-AGI 1 | Accuracy (Pass@2)60.4 | 15 | |
| Abstract Visual Reasoning | ARC-AGI 2 | Accuracy (Pass@2)11.1 | 14 | |
| Visual Reasoning | ARC 1.0 (test) | Accuracy54.5 | 9 | |
| Visual Reasoning | ARC-2 1.0 (test) | Accuracy8.3 | 7 |