DynamicVis: Dynamic Visual Perception for Efficient Remote Sensing Foundation Models

About

The advancement of RS technology has enabled high-resolution Earth observation; however, interpreting these images using modern VFMs remains a significant challenge. Unlike object-centric natural images, RS imagery is fundamentally characterized by extreme target sparsity and massive spatial redundancy. Key objects of interest (e.g., ships, vehicles) often occupy less than 1% of the spatial extent, surrounded by vast, target-free backgrounds. Existing VFMs predominantly rely on uniform dense processing (e.g., ViTs) and pixel-reconstruction pre-training paradigms (e.g., MAE). These approaches inherently waste substantial computational capacity on modeling redundant backgrounds and inadvertently dilute the feature representations of small, sparse targets. To bridge this structural misalignment, we propose DynamicVis, a visual foundation model explicitly tailored to the sparse nature of RS imagery. Architecturally, DynamicVis introduces a Dynamic Region-Aware SSM that bypasses uniform computation. It adaptively routes and incrementally models only task-relevant, high-salience tokens while employing a parameter-free integration for background context, drastically reducing the complexity of processing ultra-long 2D token sequences ($\sim$100,000). Crucially, to equip the network with robust spatial-selection capabilities, we propose a novel Region-Level Meta-Embedding Multi-Instance Learning (MIL) pre-training paradigm. Trained on a million-scale dataset, this paradigm explicitly disentangles sparse foreground instances from dense backgrounds in the latent semantic space, overcoming the semantic ambiguity of conventional pixel-reconstruction methods. Extensive evaluations across nine diverse downstream tasks reveal that DynamicVis exhibits exceptional efficacy, particularly dominating in sparse-target and instance-level perception tasks (e.g., small object detection, and change detection).

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Shijian Lu, Zhenwei Shi• 2025

Related benchmarks

Task	Dataset	Result
Change Detection	LEVIR-CD (test)	F1 Score92.32	485
Change Detection	WHU-CD (test)	IoU89.85	380
Road Extraction	Massachusetts	mIoU67.2	67
Change Detection	OSCD (test)	F1 Score60.25	31
Object Detection	LEVIR-Ship (test)	AP5084.1	31
Building Extraction	WHU dataset	F1 Score95.58	28
Object Detection	NWPU VHR (10-Split)	--	28
Scene Classification	UC Merced	Precision99.12	22
Scene Classification	AID	Precision96.4	22
Instance Segmentation	NWPU VHR-10	APmask67.8	18

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord