Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

About

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

Benjamin L. Badger• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
836
Language ModelingWikiText--
740
Mathematical ReasoningGSM8K
Accuracy1.44
388
Science Question AnsweringARC Easy
Accuracy50
162
Long-context evaluationLongBench
Average Score4.09
90
General Language UnderstandingGLUE
Accuracy0.4577
75
Word PredictionLambada OpenAI
Accuracy3.05
29
Pronoun ResolutionXWinograd
Accuracy53.41
19
Question AnsweringSQuAD
Accuracy1.47
15
Inference Efficiency1x V100 (16GB) (synthetic)
Throughput (tokens/s)2.83e+4
8
Showing 10 of 14 rows

Other info

Follow for update