Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

About

Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.

Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg• 2026

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy72.1
535
Action RecognitionDiving-48 (test)
Top-1 Acc92.2
81
Action RecognitionHRI-30
Overall Accuracy89.8
26
Action RecognitionDrive&Act
Sym Acc77.1
24
Action RecognitionSS Full v2
1-shot Accuracy66.9
21
Action RecognitionIKEA ASM
Top-1 Accuracy78.1
11
Showing 6 of 6 rows

Other info

Follow for update