Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

About

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, Bj\"orn Ommer• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement82.3
700
Motion GenerationPexels (held-out)
Min MSE21.29
9
Text-conditioned trajectory predictionLIBERO-90
Side MSE5.96
8
Text-conditioned trajectory predictionLIBERO-10
Side MSE7.43
8
Poked Motion GenerationPexels Dense
Min MSE30.4
3
Dense Track PredictionDAVIS 2017
Min MSE155.1
2
Dense Track PredictionPhysicsIQ (solid mechanics)
Min MSE90.6
2
Poked Motion GenerationPexels 1 Poke
Minimum MSE41
2
Poked Motion GenerationPexels 2 Pokes
Min MSE40.9
2
Poked Motion GenerationPexels 4 Pokes
Minimum MSE35.8
2
Showing 10 of 11 rows

Other info

GitHub

Follow for update