HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

About

Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC

Francesco Dibitonto, Cigdem Beyan, Vittorio Murino• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	ScienceQA	Accuracy40.4	446
Visual Question Answering	AI2D	Accuracy26.1	317
Visual Question Answering	RealworldQA	Accuracy38.6	259
Visual Question Answering	A-OKVQA	Acc49.8	228
Visual Question Answering	MMStar	Accuracy31.4	100
Visual Question Answering	SEED-Bench	Accuracy45.6	22

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord