Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

About

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the \textbf{\em Hyperspherical Autoencoder (HAE)}, a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that while semantic information in contrastive representations is primarily directional, enforcing strict magnitude matching hinders the preservation of fine-grained details. To address this, we introduce a {\em Directional Feature Alignment} objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention, alongside a {\em Hierarchical Convolutional Patch Embedding} module to enhance local structure preservation. Furthermore, observing that SSL-based representations intrinsically lie on a hypersphere, we employ {\em Riemannian Flow Matching} to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Notably, our manifold-aware DiT exhibits highly efficient convergence, achieving an exceptional gFID of \textbf{1.96} alongside a reconstruction rFID of \textbf{0.78} and a PSNR of \textbf{25.2} dB, validating the advantages of our manifold-aware approach.

Hun Chang, Byunghee Cha, Jong Chul Ye• 2026

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	IS209.7	606
Image Generation	ImageNet 256x256 (test)	FID3.07	125
Image Reconstruction	ImageNet 256x256 (val)	rFID0.37	53
Image Reconstruction	ImageNet-256 (test)	rFID0.37	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord