Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM

About

Retrieval benchmarks for large language models (LLMs) should reflect the long, reasoning-intensive queries typical of retrieval-augmented generation (RAG). We present a systematic study of BRIGHT, a reasoning-focused retrieval benchmark, along with strong, reproducible reference methods integrated into Anserini, Pyserini, and RankLLM. We evaluate lexical, sparse, dense, and fusion-based retrievers, as well as LLM rerankers, under long-query settings. In reproducing BRIGHT's lexical baseline, we identify a key under-documented detail: query-side BM25 (BM25Q), which applies BM25 weighting to the query itself. On long, multi-sentence queries, BM25Q consistently outperforms standard BM25, making it the strongest lexical baseline for reasoning-oriented retrieval. We further audit the BRIGHT corpus, uncovering data quality issues that impact evaluation, and offer mitigation. Finally, we study the generalizability of BM25Q across five additional benchmarks, finding its gains largely specific to BRIGHT, while fusion with standard BM25 provides the most consistent improvements across datasets.

Sahel Sharifymoghaddam, Yijun Ge, Jimmy Lin• 2025

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR	--	174
Information Retrieval	BRIGHT	Mean nDCG@1014.8	94
Passage Reranking	BRIGHT	NDCG@10 (Avg)33.4	54
First stage retrieval	BRIGHT (test)	nDCG@10 (Biology)54.5	13
Retrieval	MIRACL	nDCG@1038.2	12
Retrieval	CURE v1	nDCG@1034.6	4
Retrieval	TREC Disks 1 and 2	nDCG@1049.9	4
Retrieval	MS Marco	nDCG@1029.7	4
Retrieval	All tasks Pooled	nDCG@1035.5	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord