Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
About
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | -- | 494 | |
| Robotic Manipulation | Meta-World | Average Success Rate7.1 | 27 | |
| Robotic Manipulation | RoboCasa | Average Success Rate13.2 | 22 | |
| Robotic Manipulation | RoboMimic | Success Rate24 | 8 | |
| Robot Learning | BEHAVIOR 2025 (private) | Binary Success12.4 | 5 | |
| Robot Learning | BEHAVIOR 2025 (public) | Binary Success11.2 | 5 |