Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

About

Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.

Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy91.8
900
ReasoningBBH
Accuracy83.3
672
MathGSM8K
Accuracy0.952
206
MathematicsMATH
MATH Accuracy91.2
85
Math ReasoningGSM-Hard
Accuracy82.6
67
Math ReasoningMultiArith
Accuracy95.8
65
Knowledge ReasoningMMLU
MMLU Knowledge Reasoning Accuracy70.1
65
General ReasoningBIG-bench
Accuracy (General)75.8
36
Mathematical ReasoningMATH (test)
Exact Match (EM)74.7
16
Graduate-level Question AnsweringGPQA Diamond (test)
Accuracy33.8
16
Showing 10 of 11 rows

Other info

Follow for update