Adaptive Layer-skipping in Pre-trained LLMs

About

Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, limited attention has been paid to a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive computation in LLMs without modifying their original parameters. Applied to Llama-3-8B, it skips 8 out of 32 layers while maintaining full benchmark performance. Our experiments reveal that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Despite the computational savings, FlexiDepth does not yet achieve wall-clock speedup due to varied skipping patterns and I/O overhead. To inspire future work and advance research on practical speedup, we open-sourced FlexiDepth and a dataset documenting its layer allocation patterns.

Xuan Luo, Weizhi Wang, Xifeng Yan• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate84	527
Mathematical Reasoning	GSM8K	Accuracy65.7	38
Chinese Language Understanding	CMMLU (test)	CMMLU Score0.562	13
Robot Manipulation	Calvin ABC->D	Average Path Length1.65	7
Robot Manipulation	CALVIN D->D	Average Length1.87	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord