LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation

About

Missing data imputation is a critical challenge in various domains, such as healthcare and finance, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating biases and uncertainty in LLM outputs. To address these issues, we propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot prompt learning LLM "trees" with their outputs aggregated via confidence-based weighted voting based on LLM self-assessment, inspired by the ensemble learning (Random Forest). This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on 9 real-world datasets demonstrate the effectiveness and efficiency of LLM-Forest.

Xinrui He, Yikun Ban, Jiaru Zou, Tianxin Wei, Curtiss B. Cook, Jingrui He• 2024

Related benchmarks

Task	Dataset	Result
Data Imputation	NPHA	Accuracy66.35	30
Data Imputation	Gliomas	Accuracy84.41	30
Data Imputation	Cancer	Accuracy73.51	28
Data Imputation	Diabetes (1/3 omitted)	Accuracy63.21	16
Data Imputation	Diabetes	Accuracy63.18	14
Data Imputation	Concrete	MAE0.1036	14
Data Imputation	Yacht	MAE0.1478	14
Data Imputation	Wine	MAE0.0768	14
Data Imputation	Housing	MAE0.1026	14
Data Imputation	Credit-g	Accuracy54.46	13

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord