Structured Recurrent Mixers for Massively Parallelized Sequence Generation

About

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

Benjamin L. Badger• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	854
Language Modeling	WikiText	--	740
Mathematical Reasoning	GSM8K	Accuracy1.44	388
Science Question Answering	ARC Easy	Accuracy50	162
Long-context evaluation	LongBench	Average Score4.09	96
General Language Understanding	GLUE	Accuracy0.4577	75
Word Prediction	Lambada OpenAI	Accuracy3.05	29
Pronoun Resolution	XWinograd	Accuracy53.41	19
Question Answering	SQuAD	Accuracy1.47	15
Inference Efficiency	1x V100 (16GB) (synthetic)	Throughput (tokens/s)2.83e+4	8

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord