SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
About
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Analytical chemistry | ChemBench Analytical Chemistry | Spearman Correlation0.86 | 8 | |
| Inorganic chemistry | ChemBench Inorganic Chemistry | Spearman Correlation0.67 | 8 | |
| Material science | ChemBench Material science | Spearman Correlation0.42 | 8 | |
| Organic chemistry | ChemBench Organic Chemistry | Spearman Correlation0.89 | 8 | |
| Physical chemistry | ChemBench Physical Chemistry | Spearman Correlation0.74 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro health Virology | Spearman Correlation0.55 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Medical genetics health | Spearman Correlation0.42 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Anatomy health | Spearman Correlation0.62 | 8 | |
| Ranking Consistency Analysis | MMLU-Pro Nutrition health | Spearman Correlation0.78 | 8 | |
| Technical chemistry | ChemBench Technical Chemistry | Spearman Correlation0.86 | 8 |