OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

About

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate84	570
Robot Manipulation	LIBERO 50 demos per task	Spatial Success Rate84	17
Robot Manipulation Task Extrapolation	LIBERO goal (out-of-distribution)	Success Rate32	16
Multi-task Robotic Manipulation	LIBERO 90 tasks	Easy Success Rate96.4	10
Robot Manipulation	LIBERO-Spatial OOD	Average Success Rate11	10
Cup Handle Grasping	Real-World Cup Handle Grasping Distractor	Success Rate35	5
Cup Wall Grasping	Real-World Cup Wall Grasping Distractor	Success Rate50	5
Long-horizon manipulation	Real-World Long Horizon Task	Open Drawer Success Rate80	5
Push Cube	Real-World Push Cube Original	Success Rate45	5
Push Cube	Real-World Push Cube Cluttered	Success Rate5	5

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord