TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
About
Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: https://runjia.tech/iclr_tokenseek/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC (25-shot), MMLU (5-shot), HellaSwag (10-shot), TruthfulQA (0-shot), and WinoGrande (0-shot) (test) | ARC Accuracy53.24 | 32 | |
| Few-shot Language Evaluation | ARC, HellaSwag, MMLU, TruthfulQA, WinoGrande Few-shot Llama2-7B | ARC Accuracy53.5 | 6 | |
| Language Understanding and Code Generation | Llama 1B Evaluation Suite (ARC, HellaSwag, MMLU, TruthfulQA, WinoGrande, Humaneval) 3.2 | ARC39.08 | 6 | |
| Language Modeling Evaluation | ARC, HellaSwag, MMLU, TruthfulQA, WinoGrande | ARC Accuracy34.56 | 4 | |
| German-English translation | Aharoni & Goldberg Medical, Law, IT, and Subtitles 2020 (test) | BLEU41.63 | 3 |