Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adaptive Layer-skipping in Pre-trained LLMs

About

Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, limited attention has been paid to a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive computation in LLMs without modifying their original parameters. Applied to Llama-3-8B, it skips 8 out of 32 layers while maintaining full benchmark performance. Our experiments reveal that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Despite the computational savings, FlexiDepth does not yet achieve wall-clock speedup due to varied skipping patterns and I/O overhead. To inspire future work and advance research on practical speedup, we open-sourced FlexiDepth and a dataset documenting its layer allocation patterns.

Xuan Luo, Weizhi Wang, Xifeng Yan• 2025

Related benchmarks

TaskDatasetResultRank
Chinese Language UnderstandingCMMLU (test)
CMMLU Score0.562
13
Showing 1 of 1 rows

Other info

Follow for update