Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

About

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

Cong Ming, Ruixin Shi, Yifan Hu• 2026

Related benchmarks

TaskDatasetResultRank
Name ClassificationOAG 1.4M (test)
Accuracy80.8
9
Nationality PredictionOAG filter (test)
Accuracy82.4
9
Nationality PredictionOAG filter_aug (test)
Accuracy86.5
9
Nationality PredictionNaNa 12 Classes (test)
Accuracy80.7
3
Nationality PredictionNaNa 39 Classes (test)
Accuracy78.3
3
Nationality PredictionNaNa 99 Classes (test)
Accuracy55.2
3
Nationality Predictionoag mini 99 classes 4K (test)--
3
Nationality classificationNaNa 12 Classes 1.0 (test_gold)
Accuracy78.5
2
Nationality classificationNaNa 39 Classes 1.0 (test gold)
Accuracy72.2
2
Nationality classificationNaNa 99 Classes 1.0 (test_gold)
Accuracy40.2
2
Showing 10 of 12 rows

Other info

Follow for update