Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

About

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

Alexandros Stergiou• 2025

Related benchmarks

TaskDatasetResultRank
Embedding SimilarityVidChapters7M
Cosine Similarity3.28
6
Feature VisualizationVidChapters7M
FVD142
3
Visual Explanation GenerationVidChapters7M (test)
FVD105
3
Showing 3 of 3 rows

Other info

Follow for update