Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

About

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan• 2026

Related benchmarks

TaskDatasetResultRank
High-precision candidate selectiondocument-level 60/20/20 (test)
Number of Selections1.34e+3
9
Named Entity Recognitiondocument-level 60/20/20 fold (test)
Selections1.34e+3
7
Curator TriageBioConCal Document-level (test)
Test Precision93.9
6
Candidate scoringBiomedical document-level (test)
Test Precision93.9
4
Showing 4 of 4 rows

Other info

Follow for update