Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

About

In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Sp\"arck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'H\^opital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR text. A one-parameter closed form estimates a corpus-level q from hapax density and stays near q=1 on corpora where BM25 is already optimal. The index-time cost is a single pass over the sparse score matrix and query latency is unchanged. A tokenizer ablation shows that identifier-aware tokenization largely removes the incremental gain from q-IDF.

Santosh Kumar Radha, Oktay Goktas• 2026

Related benchmarks

TaskDatasetResultRank
Document RetrievalSciFact BEIR
Delta nDCG@100.006
16
Code and text retrievalCoIR Go (dev)
NDCG@1048.7
2
Code and text retrievalCoIR Java (dev)
NDCG@1038.3
2
Code and text retrievalCoIR Ruby (dev)
NDCG@1037
2
Code and text retrievalCoIR-PHP (HELD)
NDCG@100.373
2
Code and text retrievalCoIR-JavaScript (HELD)
NDCG@1036.2
2
Code and text retrievalCoIR-Python (HELD)
NDCG@100.727
2
Code and text retrievalBEIR-NFCorpus control
NDCG@100.305
2
Code and text retrievalBEIR-ArguAna control
NDCG@1036
2
Showing 9 of 9 rows

Other info

Follow for update