TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation
About
Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 6D Pose Estimation | SwissCube (test) | Acc (ADI-0.1d) Near83.1 | 10 | |
| 6D Pose Estimation | SPARK real sequences | Translation Error (E_T^#)0.0223 | 7 | |
| 6D Pose Estimation | SPARK synthetic cross-domain (test) | Translational Error ($E_T^#$)0.0062 | 5 | |
| 6-DoF Pose Estimation | SPADES | Error Threshold (ET)0.0123 | 4 |