For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs
About
Data valuation is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing methods typically rely on gradient computations, making them computationally prohibitive for billion-parameter models and precluding batch parallelization. In this work, we introduce For-Value, a forward-only data valuation framework that enables efficient batch-scalable value estimation while maintaining effectiveness. Leveraging the expressive power of pretrained LLMs/VLMs, we theoretically demonstrate that data valuation can be captured by the alignment between the final hidden representations and prediction errors at the last layer. In light of this insight, For-Value computes data value using a simple closed-form expression with a single forward pass, eliminating the need for costly backpropagation and enabling efficient batch calculating at scale. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in detecting influential data and mislabeled data, while achieving significant efficiency improvements.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-to-text style generation | Image-to-text style generation (VLM) | AUC97.4 | 10 | |
| Image-to-text subject generation | Image-to-text subject generation (VLM) | AUC99.5 | 10 | |
| Influential data identification | Math Problem w/o reasoning | AUC100 | 10 | |
| Influential data identification | Math Problem w/ reasoning | AUC100 | 10 | |
| Medical Visual Question Answering | PMC-Reasoning | MMMU54.12 | 10 | |
| Influential data identification | Sentence transformations | AUC100 | 10 | |
| Mislabeled Data Detection | Mislabeled Data Detection VLM | AUC99.5 | 10 | |
| Mathematical Reasoning | GSM8K | Accuracy48.3 | 9 | |
| Medical Question Answering | MedQA, MedMCQA, PubMedQA, MMLU-Pro-med, GPQA-med held-out (test) | Accuracy (MedQA)57.61 | 9 | |
| High-quality Data Detection | Noise-Huatuo-Complex-CoT | Detection Accuracy84.4 | 4 |