PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
About
This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | -- | 1891 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | -- | 345 | |
| Language Modeling | FineWeb (val) | Validation Loss3.31 | 159 | |
| Commonsense Reasoning | WinoGrande | Accuracy51.9 | 78 | |
| Multitask Language Understanding | MMLU | Accuracy25.6 | 34 | |
| 3D Semantic Segmentation | ScanNet | mIoU76.8 | 27 | |
| Question Answering | ARC-E | Normalized Accuracy (ARC-E)29 | 19 | |
| 3D Point Cloud Segmentation | SemanticKITTI | mIoU67.5 | 3 | |
| Optical Character Recognition | Ludovico Antonio Muratori (LAM) single-line | CER2.8 | 3 | |
| Optical Character Recognition | Ludovico Antonio Muratori (LAM) (multi-line) | CER3.3 | 3 |