Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
About
This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Function Approximation | Functions on low-dimensional manifolds | Depth2 | 6 |