XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

About

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

Chan-Jan Hsu, Hung-yi Lee, Yu Tsao• 2022

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)97.36	529
Natural Language Understanding	GLUE (test)	SST-2 Accuracy97.36	416
Commonsense Reasoning	SWAG (test)	Accuracy0.9283	13
Natural Language Understanding	SWAG (dev)	Accuracy92.59	6
Readability Assessment	READ (test)	RMSE0.635	2
Natural Language Understanding	READ (test)	READ Accuracy56.5	1

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord