Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

About

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

Ivan Ilin, Peter Richtarik• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity8.8
3785
Zero-shot Accuracy6-task zero-shot (MMLU, PIQA, ARC-E, ARC-C, Winogrande, OBQA)
Avg. Accuracy (Zero-Shot)56.82
59
Zero-shot ClassificationEight downstream tasks zero-shot
Accuracy (Zero-shot)35.71
30
Zero-shot EvaluationEight tasks zero-shot
Accuracy (Zero-shot)49.33
29
Showing 4 of 4 rows

Other info

Follow for update