FOCA: Multimodal Malware Classification via Hyperbolic Cross-Attention
About
In this work, we introduce FOCA, a novel multimodal framework for malware classification that jointly leverages audio and visual modalities. Unlike conventional Euclidean-based fusion methods, FOCA is the first to exploit the intrinsic hierarchical relationships between audio and visual representations within hyperbolic space. To achieve this, raw binaries are transformed into both audio and visual representations, which are then processed through three key components: (i) a hyperbolic projection module that maps Euclidean embeddings into the Poincare ball, (ii) a hyperbolic cross-attention mechanism that aligns multimodal dependencies under curvature-aware constraints, and (iii) a Mobius addition-based fusion layer. Comprehensive experiments on two benchmark datasets-Mal-Net and CICMalDroid2020- show that FOCA consistently outperforms unimodal models, surpasses most Euclidean multimodal baselines, and achieves state-of-the-art performance over existing works.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Malware Classification | Mal-Net | Accuracy82.84 | 38 | |
| Malware Classification | CICMalDroid 2020 | Accuracy0.991 | 38 |