UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment
About
Vision-language models (VLMs) can describe urban scenes in rich detail, yet consistently fail to produce reliable human preference labels in domain-specific tasks such as safety assessment and aesthetic evaluation. The standard fix, fine-tuning or RLHF, requires large-scale annotations and model retraining. We ask a different question: can a frozen VLM be aligned with human preferences without modifying any weights? Our key insight is that VLMs are strong concept extractors but poor decision calibrators. We propose a three-stage post-hoc pipeline that exploits this asymmetry: (i) interpretable evaluation dimensions are automatically mined from consensus exemplars; (ii) an Observer-Debater-Judge chain extracts robust concept scores from the frozen VLM; and (iii) locally-weighted ridge regression on a hybrid manifold calibrates these scores to human ratings. Applied as UrbanAlign on Place Pulse 2.0, the framework reaches 72.2% accuracy (kappa=0.45) across six perception categories, outperforming all baselines by +11.0 pp and zero-shot VLM by +15.5 pp, with full interpretability and zero weight modification.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Urban Perception (Beautiful) | Place Pulse disjoint pool subset 2.0 | Accuracy (Excl. Equal)69.8 | 5 | |
| Urban Perception (Boring) | Place Pulse disjoint pool 2.0 | Acc (excl-equal)70.2 | 5 | |
| Urban Perception (Depressing) | Place Pulse disjoint pool 2.0 | Accuracy (excl-equal)68.2 | 5 | |
| Urban Perception (Lively) | Place Pulse disjoint pool 2.0 | Accuracy (%) (Excl. Equal)69.4 | 5 | |
| Urban Perception (Safety) | Place Pulse disjoint pool 2.0 | Accuracy (Excl. Equal)81.6 | 5 | |
| Urban Perception (Wealthy) | Place Pulse disjoint pool 2.0 | Accuracy (Excl. Equal)74 | 5 |