TPCL: Task Progressive Curriculum Learning for Robust Visual Question Answering

About

Visual Question Answering (VQA) systems are notoriously brittle under distribution shifts and data scarcity. While previous solutions-such as ensemble methods and data augmentation-can improve performance in isolation, they fail to generalise well across in-distribution (IID), out-of-distribution (OOD), and low-data settings simultaneously. We argue that this limitation stems from the suboptimal training strategies employed. Specifically, treating all training samples uniformly-without accounting for question difficulty or semantic structure-leaves the models vulnerable to dataset biases. Thus, they struggle to generalise beyond the training distribution. To address this issue, we introduce Task-Progressive Curriculum Learning (TPCL)-a simple, model-agnostic framework that progressively trains VQA models using a curriculum built by jointly considering question type and difficulty. Specifically, TPCL first groups questions based on their semantic type (e.g., yes/no, counting) and then orders them using a novel Optimal Transport-based difficulty measure. Without relying on data augmentation or explicit debiasing, TPCL improves generalisation across IID, OOD, and low-data regimes and achieves state-of-the-art performance on VQA-CP v2, VQA-CP v1, and VQA v2. It outperforms the most competitive robust VQA baselines by over 5% and 7% on VQA-CP v2 and v1, respectively, and boosts backbone performance by up to 28.5%.

Ahmed Akl, Abdelwahed Khamis, Zhe Wang, Ali Cheraghian, Sara Khalifa, Kewen Wang• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (val)	Accuracy78.42	144
Visual Question Answering	VQA-CP v2 (test)	Overall Accuracy77.23	128
Visual Question Answering	VQA-CP v1 (test)	Accuracy (Overall)76.78	33
Visual Question Answering	VQA-CP v2	Overall Accuracy77.23	16
Visual Question Answering	VQA v2	Overall Accuracy78.42	15

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord