Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

About

Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.

Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adri\'an Noriega de la Colina• 2026

Related benchmarks

TaskDatasetResultRank
Tweet ClassificationTweetEval 1.0 (test)
Emoji (M-F1)34.2
18
Taxonomic ClassificationCAMI II metagenome 2017
Taxa F1 Score91.7
9
Variant CallingGIAB HG002 truth set (test)
F1 Score (Variant)89.1
9
Sequence ReconstructionGenomic Reads ART simulator 150bp paired-end GRCh38 reference
Reconstruction Rate24.1
9
Genomics Variant CallingGIAB HG002 ONT
Variant F186.4
8
Next-Generation Sequencing AnalysisUHGG NGS
Variant F191.5
8
Pathogen DetectionPathogen Detection (T-1, T-2, T-3, T-4, T-5)
T-1 Accuracy93.8
8
Sequence ReconstructionGIAB HG002 ONT
Recon Loss0.305
8
Taxonomic ClassificationGIAB HG002 ONT
Taxa Accuracy F10.881
8
Pathogen DetectionGUE (Genomic Understanding Evaluation)
Pathogen Detection Average Accuracy94.53
6
Showing 10 of 17 rows

Other info

Follow for update