Neuro-Inspired Inverse Learning for Planning and Control
About
We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | Maze2D medium v1 | Normalized Return166.8 | 30 | |
| Offline Reinforcement Learning | Maze2D large v1 | Normalized Return220.7 | 30 | |
| Planning and Control | maze2d-umaze v1 (100 episodes, 300 steps/ep) | Score165.2 | 16 | |
| Offline Reinforcement Learning | AntMaze medium-play v2 | Average Score87.8 | 14 | |
| Offline Reinforcement Learning | AntMaze Medium-Diverse v2 | Average Score0.965 | 14 | |
| Offline Reinforcement Learning | AntMaze large-play v2 | D4RL Score0.93 | 11 | |
| Offline Reinforcement Learning | AntMaze large-diverse v2 | D4RL Score94 | 11 | |
| Offline Reinforcement Learning | AntMaze v2 | umaze Success Rate99.5 | 7 | |
| Offline Reinforcement Learning | Antmaze v0 (test) | -- | 5 | |
| Locomotion | Antmaze u-umaze v2 | D4RL Score (%)99.5 | 2 |