Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

About

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb-Edu (test)
Perplexity (Test)29.5
58
Robustness EvaluationLexical Variation (abbr.)
Jensen-Shannon Divergence0.0498
8
Open-ended Text GenerationChatbot Arena inspired qualitative prompts (val)
ELO1.02e+3
4
Robustness EvaluationLexical Variation (punctuation)
Jensen-Shannon Divergence0.3982
4
Robustness EvaluationLexical Variation spelling
Jensen-Shannon Divergence0.026
4
Robustness EvaluationLexical Variation synonym
Jensen-Shannon Divergence0.1132
4
Robustness EvaluationLexical Variation typos
Jensen-Shannon Divergence0.2074
4
Natural Language UnderstandingGLUE downstream fine-tuning
CoLA Score27.7
4
Robustness EvaluationLexical Variation contraction
Jensen-Shannon Divergence0.0823
4
Showing 9 of 9 rows

Other info

Follow for update