Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

About

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.

H S V N S Kowndinya Renduchintala, Sumit Bhatia• 2026

Related benchmarks

TaskDatasetResultRank
Linguistic Minimal Pair ClassificationBLiMP only_npi_scope
Accuracy69.4
3
Linguistic Minimal Pair ClassificationBLiMP existential_there_quantifiers_2
Accuracy51.8
3
Linguistic Minimal Pair ClassificationBLiMP wh_vs_that_with_gap_long_distance
Accuracy41.2
3
Linguistic Minimal Pair ClassificationBLiMP matrix_question_npi_licensor_present
Accuracy44.2
3
Linguistic Minimal Pair ClassificationBLiMP principle_A_reconstruction
Accuracy78.9
3
Linguistic Minimal Pair ClassificationBLiMP sentential_subject_island
Accuracy52.2
3
Linguistic Minimal Pair ClassificationBLiMP left branch island echo question
Accuracy63.2
3
Linguistic Minimal Pair ClassificationBLiMP coordinate_structure_constraint_complex_left_branch
Accuracy62.4
3
Linguistic Minimal Pair ClassificationBLiMP principle_A_c_command
Accuracy46
3
Showing 9 of 9 rows

Other info

Follow for update