Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DynamicVis: Dynamic Visual Perception for Efficient Remote Sensing Foundation Models

About

The advancement of RS technology has enabled high-resolution Earth observation; however, interpreting these images using modern VFMs remains a significant challenge. Unlike object-centric natural images, RS imagery is fundamentally characterized by extreme target sparsity and massive spatial redundancy. Key objects of interest (e.g., ships, vehicles) often occupy less than 1% of the spatial extent, surrounded by vast, target-free backgrounds. Existing VFMs predominantly rely on uniform dense processing (e.g., ViTs) and pixel-reconstruction pre-training paradigms (e.g., MAE). These approaches inherently waste substantial computational capacity on modeling redundant backgrounds and inadvertently dilute the feature representations of small, sparse targets. To bridge this structural misalignment, we propose DynamicVis, a visual foundation model explicitly tailored to the sparse nature of RS imagery. Architecturally, DynamicVis introduces a Dynamic Region-Aware SSM that bypasses uniform computation. It adaptively routes and incrementally models only task-relevant, high-salience tokens while employing a parameter-free integration for background context, drastically reducing the complexity of processing ultra-long 2D token sequences ($\sim$100,000). Crucially, to equip the network with robust spatial-selection capabilities, we propose a novel Region-Level Meta-Embedding Multi-Instance Learning (MIL) pre-training paradigm. Trained on a million-scale dataset, this paradigm explicitly disentangles sparse foreground instances from dense backgrounds in the latent semantic space, overcoming the semantic ambiguity of conventional pixel-reconstruction methods. Extensive evaluations across nine diverse downstream tasks reveal that DynamicVis exhibits exceptional efficacy, particularly dominating in sparse-target and instance-level perception tasks (e.g., small object detection, and change detection).

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Shijian Lu, Zhenwei Shi• 2025

Related benchmarks

TaskDatasetResultRank
Change DetectionLEVIR-CD (test)
F1 Score92.32
485
Change DetectionWHU-CD (test)
IoU89.85
372
Road ExtractionMassachusetts
mIoU67.2
41
Change DetectionOSCD (test)
F1 Score60.25
31
Object DetectionLEVIR-Ship (test)
AP5084.1
31
Building ExtractionWHU dataset
F1 Score95.58
28
Scene ClassificationUC Merced
Precision99.12
22
Scene ClassificationAID
Precision96.4
22
Instance SegmentationNWPU VHR-10
APmask67.8
18
Instance SegmentationSSDD
APmask71
18
Showing 10 of 16 rows

Other info

Follow for update