SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
About
We present SwiftEmbed, a production-oriented serving system for static token embeddings that achieves 1.12\,ms p50 latency for single-text requests while maintaining a 60.6 MTEB average score across 8 representative tasks. Built around the open-source Potion-base-8M distilled model from MinishLab and implemented in Rust, the system delivers 50,000 requests per second through static embedding lookup, mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP) and strong semantic similarity (76.1% Spearman correlation). Performance relative to Sentence-BERT is task-dependent: robust for deduplication and similarity workloads (89--100%), substantially lower for classification and complex retrieval tasks (75%). Domain-specific performance ranges from 75% to 131% of a GloVe-840B baseline. The system targets real-time embedding applications where sub-5\,ms latency is operationally critical and where full transformer inference is not feasible.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Throughput Scalability | General Inference Workload | RPS5.00e+4 | 15 | |
| Massive Text Embedding Evaluation | MTEB 8 representative tasks including Banking77, SprintDuplicateQuestions, TwitterSemEval, ArguAna (test) | Classification Acc0.589 | 8 | |
| Text Embedding | MTEB | MTEB Quality Score60.6 | 8 | |
| Duplicate Detection | Downstream Task Evaluation | Score1 | 2 | |
| Classification | Downstream Task Evaluation | Score50 | 2 | |
| Clustering | Downstream Task Evaluation | Score0.627 | 2 | |
| Semantic Search | Downstream Task Evaluation | Score87.7 | 2 |