Improvise, Adapt, Overcome -- Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging

About

Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg's vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters--244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy. Our source code is publicly available at https://github.com/Ujjwal238/Telescopic_adapters

Ujjwal Mishra, Vinita Shukla, Praful Hambarde, Amit Shukla• 2025

Related benchmarks

Task	Dataset	Result
Medical Image Segmentation	BUSI (test)	Dice65.9	228
Binary Segmentation	Kvasir-SEG (test)	DSC0.8979	67
Image Segmentation	ISIC 2016 (test)	Dice Coefficient92.18	40
Semantic segmentation	BKAI (test)	DSC88.38	13
Semantic segmentation	ClinicDB (test)	DSC (%)91.67	13

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord