HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
About
The workflow of pretraining and fine-tuning has emerged as a popular paradigm for solving various NLP and V&L (Vision-and-Language) downstream tasks. With the capacity of pretrained models growing rapidly, how to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment. In this paper, we design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings as input, and outputs weights for fine-tuning different small modules in a pretrained language model, such as tuning the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feed-forward blocks (i.e., adapter-tuning). We define a set of embeddings (e.g., layer, block, task and visual embeddings) as the key components to calculate hyper-embeddings, which thus can support both pure language and V&L tasks. Our proposed framework adds fewer trainable parameters in multi-task learning while achieving superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework on both textual and visual modalities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentiment Analysis | IMDB (test) | Accuracy90.5 | 248 | |
| Question Classification | TREC (test) | Accuracy97.2 | 124 | |
| Visual Question Answering | OKVQA (val) | VQA Score35.86 | 101 | |
| Question Answering | BoolQ (test) | Accuracy75.7 | 46 | |
| Natural Language Inference | CB SuperGLUE (test) | Accuracy91.43 | 33 | |
| Visual Entailment | SNLI-VE (test-p) | Accuracy65.67 | 24 | |
| Paraphrase Detection | PAWS original (test) | Accuracy91.79 | 23 |