Unlocking Slot Attention by Changing Optimal Transport Costs
About
Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting.
Yan Zhang, David W. Zhang, Simon Lacoste-Julien, Gertjan J. Burghouts, Cees G. M. Snoek• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | PokerRules standard (test) | Task Accuracy99.93 | 6 | |
| Image Classification | MM-A in-distribution (test) | Accuracy98.86 | 6 | |
| Image Classification | MM-A out-of-distribution (OOD) | Task Accuracy18.26 | 6 | |
| Classification | PokerRules Extrapolation: 5 cards (In-distribution class) | Task Accuracy37.8 | 5 | |
| Image Classification | MM-A Extrapolation 4 digits | Task Accuracy37.5 | 5 | |
| Image Classification | MM-A Extrapolation 5 digits | Task Accuracy12 | 5 | |
| Addition | CLEVR-Addition (test) | Task Accuracy96.97 | 3 | |
| Addition | CLEVR-Addition 7 objects (extrapolation) | Task Accuracy0.5 | 3 |
Showing 8 of 8 rows