Controlla: Learning Controllability via Graph-Constrained Latent Geometry

About

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

Jamuna S. Murthy, Amin Karimi Monsefi, Rajiv Ramnath• 2026

Related benchmarks

Task	Dataset	Result
Controllable Image Generation and Editing	CelebA-HQ (test)	Accuracy73.6	20
Facial Image Editing	AffectNet	Accuracy72.8	20
Human Image Controllability and Editing	AffectHuman-43K (test)	Accuracy76.4	20
Controllable Image Generation	AffectHuman-43K (val)	Accuracy77.6	6
Controllable Image Generation	AffectHuman-43K (test)	Accuracy76.4	6
Emotion Recognition	AffectNet	Accuracy72.8	5
Identity Preservation	CelebA-HQ	Identity Score0.868	5
Geometry-aware evaluation	AffectHuman-43K (val)	ID Score87.4	4
Geometry-aware evaluation	AffectHuman-43K (test)	ID0.862	4
Image-to-Audio Retrieval	AffectHuman-43K (test)	R@174.1	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord