A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

About

Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.

Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, Keze Wang• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy57.7	524
Visual Reasoning	NLVR2 (test)	Accuracy75.2	46
Visual Reasoning	GQA balanced (test-dev)	Accuracy50.4	6
Visual Question Answering	Open Images cross-task (test)	Accuracy42.5	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord