PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
About
Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 | -- | 691 | |
| Image Classification | EuroSAT | -- | 569 | |
| Image Classification | Food-101 | -- | 542 | |
| Image Classification | DTD | Accuracy25.5 | 485 | |
| Text-to-Image Retrieval | Flickr30k (test) | -- | 445 | |
| Image Classification | SUN397 | Accuracy55.32 | 441 | |
| Classification | Cars | -- | 395 | |
| Image-to-Text Retrieval | Flickr30k (test) | -- | 392 | |
| Image Classification | RESISC45 | -- | 349 | |
| Image Classification | CUB | Accuracy15.9 | 282 |