Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy

About

This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.

Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra• 2025

Related benchmarks

Task	Dataset	Result	Rank
Malware Classification	VirusShare	Accuracy99.02		4

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord