Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM

About

Retrieval benchmarks for large language models (LLMs) should reflect the long, reasoning-intensive queries typical of retrieval-augmented generation (RAG). We present a systematic study of BRIGHT, a reasoning-focused retrieval benchmark, along with strong, reproducible reference methods integrated into Anserini, Pyserini, and RankLLM. We evaluate lexical, sparse, dense, and fusion-based retrievers, as well as LLM rerankers, under long-query settings. In reproducing BRIGHT's lexical baseline, we identify a key under-documented detail: query-side BM25 (BM25Q), which applies BM25 weighting to the query itself. On long, multi-sentence queries, BM25Q consistently outperforms standard BM25, making it the strongest lexical baseline for reasoning-oriented retrieval. We further audit the BRIGHT corpus, uncovering data quality issues that impact evaluation, and offer mitigation. Finally, we study the generalizability of BM25Q across five additional benchmarks, finding its gains largely specific to BRIGHT, while fusion with standard BM25 provides the most consistent improvements across datasets.

Sahel Sharifymoghaddam, Yijun Ge, Jimmy Lin• 2025

Related benchmarks

TaskDatasetResultRank
Information RetrievalBEIR--
120
Information RetrievalBRIGHT
Mean nDCG@1014.8
94
Passage RerankingBRIGHT
NDCG@10 (Avg)33.4
54
First stage retrievalBRIGHT (test)
nDCG@10 (Biology)54.5
13
RetrievalMIRACL
nDCG@1038.2
12
RetrievalCURE v1
nDCG@1034.6
4
RetrievalTREC Disks 1 and 2
nDCG@1049.9
4
RetrievalMS Marco
nDCG@1029.7
4
RetrievalAll tasks Pooled
nDCG@1035.5
4
Showing 9 of 9 rows

Other info

Follow for update