OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

About

Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

Meng Lou, Yizhou Yu• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU51.3	3089
Object Detection	COCO 2017 (val)	--	2930
Instance Segmentation	COCO 2017 (val)	APm0.44	1304
Image Classification	ImageNet-1k (val)	Top-1 Accuracy85.1	871
Object Detection	COCO	AP (Box)53.9	186
Pathology Image Classification	BreakHis (test)	Top-1 Accuracy97.98	46
Medical Image Classification	BUSI (test)	Accuracy92.97	23
Fish Feeding Intensity Quantification	Fish Feeding Intensity Dataset	Accuracy94.51	9
Medical Imaging Classification	Shenzhen public (test)	Accuracy89.39	9
Medical Imaging Classification	Covid public (test)	Accuracy97.89	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord