What's Left? Concept Grounding with Logic-Enhanced Foundation Models

About

Recent works such as VisProg and ViperGPT have smartly composed foundation models for visual reasoning-using large language models (LLMs) to produce programs that can be executed by pre-trained vision-language models. However, they operate in limited domains, such as 2D images, not fully exploiting the generalization of language: abstract concepts like "left" can also be grounded in 3D, temporal, and action data, as in moving to your left. This limited generalization stems from these inference-only methods' inability to learn or adapt pre-trained models to a new domain. We propose the Logic-Enhanced Foundation Model (LEFT), a unified framework that learns to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor. LEFT has an LLM interpreter that outputs a program represented in a general, logic-based reasoning language, which is shared across all domains and tasks. LEFT's executor then executes the program with trainable domain-specific grounding modules. We show that LEFT flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, and robotic manipulation. It exhibits strong reasoning ability in a wide variety of tasks, including those that are complex and not seen during training, and can be easily applied to new domains.

Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Jiajun Wu• 2023

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	Sr3D (test)	Overall Accuracy62	73
Visual Question Answering	CLEVR 1.0 (test)	Overall Accuracy99.6	46
Visual Question Answering	CLEVR-Humans 1.0 (test)	Accuracy78.8	22
Abstract Reasoning	CLEVR-RPM (test)	Accuracy100	7
Multi-step Reasoning	CLEVR-Puzzle (test)	Accuracy92	7
Abstract Visual Reasoning	CLEVR-RPM	Accuracy100	6
Visual Reasoning	CLEVR-Puzzles	Accuracy92	6
Object Localization	ReaSCAN (test)	Success Rate A197.8	6
Referring Expressions	CLEVR-Ref (test)	Accuracy100	5
Referring Expression Grounding	CLEVR-Ref	Accuracy100	4

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord