Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Free Energy Mixer

About

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

Jiecheng Lu, Shihao Yang• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc80.45
1239
Time Series ForecastingETTh1
MSE0.414
836
Time Series ForecastingETTh2
MSE0.339
796
Time Series ForecastingETTm2
MSE0.241
536
Synthetic in-context reasoningMAD synthetic (test)
Compression Score55.5
29
Time Series ForecastingWeather
MSE0.218
25
Commonsense Reasoning and Knowledge Question AnsweringGeneral Ability Suite (ARC, HellaSwag, PIQA, BoolQ, WinoGrande, COPA, OBQA, SciQ) various (test)
ARC-C Accuracy36.4
19
Comparative RankingUnified Evaluation v1 (aggregate)
Average Rank1.81
19
Unified Multi-task Language Understanding and Instruction FollowingOpen LLM Leaderboard v1 (test)
MMLU-P Accuracy11.5
19
Time Series Forecastingsolar
MSE0.186
9
Showing 10 of 11 rows

Other info

Follow for update