Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

About

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Bj\"orn Ommer• 2025

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256 (test)	FID1.88	223
Image Reconstruction	ImageNet 256x256	rFID1.85	202
Image Reconstruction	ImageNet-1K 1.0 (val)	rFID1.85	42
Class-conditional Image Generation	ImageNet-1K 1.0 (val)	gFID1.88	12

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord