Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

About

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

Jamuna S. Murthy, Amin Karimi Monsefi, Rajiv Ramnath• 2026

Related benchmarks

TaskDatasetResultRank
Controllable Image Generation and EditingCelebA-HQ (test)
Accuracy73.6
20
Facial Image EditingAffectNet
Accuracy72.8
20
Human Image Controllability and EditingAffectHuman-43K (test)
Accuracy76.4
20
Controllable Image GenerationAffectHuman-43K (val)
Accuracy77.6
6
Controllable Image GenerationAffectHuman-43K (test)
Accuracy76.4
6
Emotion RecognitionAffectNet
Accuracy72.8
5
Identity PreservationCelebA-HQ
Identity Score0.868
5
Geometry-aware evaluationAffectHuman-43K (val)
ID Score87.4
4
Geometry-aware evaluationAffectHuman-43K (test)
ID0.862
4
Image-to-Audio RetrievalAffectHuman-43K (test)
R@174.1
3
Showing 10 of 11 rows

Other info

Follow for update