Caracal: Causal Architecture via Spectral Mixing

About

The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, O(L log(L)) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix.

Bingzheng Gan, Tianyi Zhang, Yusu Li, Jing Huang, Wei Shi, Yangkai Ding, Tao Yu• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy45.1	897
Question Answering	ARC Challenge	Accuracy (ARC)29.69	631
Question Answering	PIQA	Accuracy69.26	589
Commonsense Reasoning	WinoGrande	Accuracy53.2	453
Language Modeling	LAMBADA	Accuracy35.26	412
Question Answering	BoolQ	Accuracy61.9	233
Social Interaction Question Answering	SIQA	Accuracy39.51	157
Information Extraction and Retrieval	SWDE	Accuracy0.0882	5
Information Extraction and Retrieval	FDA	Accuracy1.91	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord