Controlla: Learning Controllability via Graph-Constrained Latent Geometry
About
Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Controllable Image Generation and Editing | CelebA-HQ (test) | Accuracy73.6 | 20 | |
| Facial Image Editing | AffectNet | Accuracy72.8 | 20 | |
| Human Image Controllability and Editing | AffectHuman-43K (test) | Accuracy76.4 | 20 | |
| Controllable Image Generation | AffectHuman-43K (val) | Accuracy77.6 | 6 | |
| Controllable Image Generation | AffectHuman-43K (test) | Accuracy76.4 | 6 | |
| Emotion Recognition | AffectNet | Accuracy72.8 | 5 | |
| Identity Preservation | CelebA-HQ | Identity Score0.868 | 5 | |
| Geometry-aware evaluation | AffectHuman-43K (val) | ID Score87.4 | 4 | |
| Geometry-aware evaluation | AffectHuman-43K (test) | ID0.862 | 4 | |
| Image-to-Audio Retrieval | AffectHuman-43K (test) | R@174.1 | 3 |