Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Tuning Language Models by Proxy

About

Despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. We then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. Finally, we show how to proxy-tune a truly black-box LM, GPT-3.5, for temporal adaptation, increasing its knowledge about recent events. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.

Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy52.5
212
Instruction FollowingAlpacaEval
Win Rate72.3
125
Instruction FollowingAlpacaEval 805 instructions (test)
Win Rate10.47
14
Biomedical Question AnsweringBioASQ
Factoid Acc29
11
ClassificationMTS-Specialty 200 random samples (test)
Macro F1 Score9.5
6
Text GenerationMTS-Procedure 100 samples (test)
MedHelm LLM-Jury Score3.466
6
Text GenerationMIMIC-RRS 100 samples (test)
MedHelm LLM-jury score3.898
6
Text GenerationMIMIC-BHC 100 samples (test)
MedHelm Jury Score3.456
6
ClassificationMedNLI 200 random samples (test)
Macro F1 Score57.5
6
ClassificationCLIP 200 random samples (test)
Macro F1 Score0.143
6
Showing 10 of 10 rows

Other info

Follow for update