Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

About

Large language models often fail at simple counting tasks, even when items to count are in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert representations to the correct output tokens. Across three model families: Pythia, Qwen3, and Mistral, ranging from 0.4B to 14B parameters, we find evidence for the second explanation. Linear probes recover the correct count from intermediate layers with $R^2>0.99$, showing that the information is present. However, the internal directions that encode counts are nearly orthogonal to digit-token output-head rows ($|\cos| \leq 0.032$). In other words, the model stores the count in a form that the digit logits do not naturally read out. We localize this failure with two interventions. Updating only the digit rows of the output head (36,864 parameters) substantially improves constrained digit prediction (60.7--100.0% on four tasks), but it does not fix unconstrained generation (0%); we do not claim that digit-row repair fixes open-ended text. By contrast, small LoRA on attention Q/V (7.67M parameters) improves upstream routing and achieves 83.1%$\pm$7.2% in true greedy autoregressive generation (deployable fix). Logit-lens at layer 35 (entity counting; correct-digit rank): (i) median over 3 seeds drops from order-$10^4$ to 1; (ii) seed 42 shows $54{,}332 \to 838$ (median top-1 while one seed stays far below). Norm, logit-lens, and cross-task analyses generalize the bottleneck to counting, addition, and list length; nulls on MMLU and GSM8K and limited DROP transfer. These results identify counting failure as a geometric readout bottleneck, not an internal-representation failure: the model knows the count but the output pathway is misaligned with tokens needed to express it.

Gabriel Garcia• 2026

Related benchmarks

TaskDatasetResultRank
Entity countingNL Counting (held-out)
Accuracy98.7
14
Entity countingEntity counting Qwen3-8B prompts N=200x3 seeds (test)
Greedy Generation Accuracy97
7
Entity countingEntity counting 200 prompts Qwen3-8B (test)
Accuracy98.7
5
Majority VoteMajority-vote 432 prompts (test)
Accuracy100
5
Character countCharacter count 200 prompts Qwen3-8B (test)
Accuracy98
4
Max extractionMax-extraction (test)
Accuracy40
4
List lengthList length 200 prompts Qwen3-8B (test)
Accuracy99.2
4
AdditionAddition 200 prompts Qwen3-8B (test)
Accuracy100
3
Entity countingNL Counting (train)
Accuracy99.2
2
Showing 9 of 9 rows

Other info

Follow for update