Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization
About
Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tweet Classification | TweetEval 1.0 (test) | Emoji (M-F1)34.2 | 18 | |
| Taxonomic Classification | CAMI II metagenome 2017 | Taxa F1 Score91.7 | 9 | |
| Variant Calling | GIAB HG002 truth set (test) | F1 Score (Variant)89.1 | 9 | |
| Sequence Reconstruction | Genomic Reads ART simulator 150bp paired-end GRCh38 reference | Reconstruction Rate24.1 | 9 | |
| Genomics Variant Calling | GIAB HG002 ONT | Variant F186.4 | 8 | |
| Next-Generation Sequencing Analysis | UHGG NGS | Variant F191.5 | 8 | |
| Pathogen Detection | Pathogen Detection (T-1, T-2, T-3, T-4, T-5) | T-1 Accuracy93.8 | 8 | |
| Sequence Reconstruction | GIAB HG002 ONT | Recon Loss0.305 | 8 | |
| Taxonomic Classification | GIAB HG002 ONT | Taxa Accuracy F10.881 | 8 | |
| Pathogen Detection | GUE (Genomic Understanding Evaluation) | Pathogen Detection Average Accuracy94.53 | 6 |