Composable Sparse Fine-Tuning for Cross-Lingual Transfer
About
Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentiment Analysis | MultiMM CN (test) | F1 Score69.02 | 24 | |
| Sentiment Analysis | MultiMM EN (test) | F1 Score69.93 | 24 | |
| Metaphor Detection | MultiMM EN (test) | F1 Score68.15 | 24 | |
| Metaphor Detection | MultiMM CN (test) | F1 Score66.41 | 24 | |
| Named Entity Recognition | MasakhaNER (test) | F1 Score71.7 | 19 | |
| Named Entity Recognition | PAN-X | Macro Avg Score0.825 | 16 | |
| Dependency Parsing | Universal Dependencies 2.7 (test) | AR DP Score70.8 | 14 | |
| Question Answering | XQuAD | F1 (ar)73 | 12 | |
| Named Entity Recognition | NER Average over all languages (test) | F1 Score71.7 | 9 | |
| Natural Language Inference | AmericasNLI (test) | Accuracy (aym)58.1 | 9 |