Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

About

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

Zhongjie Shi, Wenjing Liao• 2026

Related benchmarks

Task	Dataset	Result	Rank
Function Approximation	Functions on low-dimensional manifolds	Depth2		6

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord