Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

About

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the \textbf{\em Hyperspherical Autoencoder (HAE)}, a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that while semantic information in contrastive representations is primarily directional, enforcing strict magnitude matching hinders the preservation of fine-grained details. To address this, we introduce a {\em Directional Feature Alignment} objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention, alongside a {\em Hierarchical Convolutional Patch Embedding} module to enhance local structure preservation. Furthermore, observing that SSL-based representations intrinsically lie on a hypersphere, we employ {\em Riemannian Flow Matching} to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Notably, our manifold-aware DiT exhibits highly efficient convergence, achieving an exceptional gFID of \textbf{1.96} alongside a reconstruction rFID of \textbf{0.78} and a PSNR of \textbf{25.2} dB, validating the advantages of our manifold-aware approach.

Hun Chang, Byunghee Cha, Jong Chul Ye• 2026

Related benchmarks

TaskDatasetResultRank
Image GenerationImageNet 256x256
IS209.7
517
Image GenerationImageNet 256x256 (test)
FID3.07
83
Image ReconstructionImageNet 256x256 (val)
rFID0.37
53
Image ReconstructionImageNet-256 (test)
rFID0.37
8
Showing 4 of 4 rows

Other info

Follow for update