Free Energy Mixer

About

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

Jiecheng Lu, Shihao Yang• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc80.45	1239
Time Series Forecasting	ETTh1	MSE0.414	836
Time Series Forecasting	ETTh2	MSE0.339	796
Time Series Forecasting	ETTm2	MSE0.241	536
Synthetic in-context reasoning	MAD synthetic (test)	Compression Score55.5	29
Time Series Forecasting	Weather	MSE0.218	25
Commonsense Reasoning and Knowledge Question Answering	General Ability Suite (ARC, HellaSwag, PIQA, BoolQ, WinoGrande, COPA, OBQA, SciQ) various (test)	ARC-C Accuracy36.4	19
Comparative Ranking	Unified Evaluation v1 (aggregate)	Average Rank1.81	19
Unified Multi-task Language Understanding and Instruction Following	Open LLM Leaderboard v1 (test)	MMLU-P Accuracy11.5	19
Time Series Forecasting	solar	MSE0.186	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord