TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation
About
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Frame Interpolation | BS-ERGB | LPIPS0.0684 | 17 | |
| Optical Flow | DSEC All | EPE1.39 | 12 | |
| Optical Flow | DSEC interlaken_00_b | EPE2.13 | 12 | |
| Optical Flow | DSEC interlaken_01_a | EPE1.51 | 12 | |
| Optical Flow | DSEC thun_01_a | EPE1.04 | 12 | |
| Optical Flow | DSEC thun_01_b | EPE1.12 | 12 | |
| Optical Flow | DSEC zurich_city_12_a | EPE1.06 | 12 | |
| Optical Flow | DSEC zurich_city_14_c | Endpoint Error (EPE)1.24 | 12 | |
| Optical Flow | DSEC zurich_city_15_a | Endpoint Error (EPE)1.37 | 12 | |
| Video Frame Interpolation | HQ-EVFI dynamic motion subset | FID17.39 | 8 |