Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

About

We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Ilia Larchenko, Gleb Zarin, Akash Karnatak• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO--
494
Robotic ManipulationMeta-World
Average Success Rate7.1
27
Robotic ManipulationRoboCasa
Average Success Rate13.2
22
Robotic ManipulationRoboMimic
Success Rate24
8
Robot LearningBEHAVIOR 2025 (private)
Binary Success12.4
5
Robot LearningBEHAVIOR 2025 (public)
Binary Success11.2
5
Showing 6 of 6 rows

Other info

GitHub

Follow for update