AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation
About
Hadamard transforms have become a key tool for stabilizing low-precision training, but existing methods apply them uniformly across tensors and computation paths. We show that this one-size-fits-all strategy is inherently limited: Hadamard smoothing reduces quantization error only when its direction is properly aligned with the operand's outlier structure. Through a systematic study of weights, activations, and gradients in LLM training, we identify three stable outlier patterns, Row-wise, Column-wise, and None, and show that each outlier pattern pair in matrix multiplication requires a distinct transform or outlier-handling strategy. We propose AdaHOP, Adaptive Hadamard transform with Outlier-Pattern-aware strategy, which applies Inner Hadamard Transform (IHT) when inner-dimension mixing properly suppresses the operands' outliers, and selectively applies Outlier Extraction (OE) that extracts dominant outlier rows or columns into a high-precision path when it does not. With fused, hardware-aware Triton kernels, AdaHOP enables training from scratch at MXFP4 precision with BF16-level quality, while achieving up to 3.6X memory compression, 1.46X end-to-end training speedup over BF16.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy64.98 | 711 | |
| Physical Commonsense Reasoning | PIQA | Accuracy76.28 | 696 | |
| Question Answering | ARC-E | Accuracy55.89 | 523 | |
| Language Modeling | LAMBADA | Accuracy48.57 | 103 | |
| Model Training | Llama3.1-8B (train) | Memory (GB)20.94 | 7 |