Mitigating Structural Overfitting: A Distribution-Aware Rectification Framework for Missing Feature Imputation

About

Incomplete node features are ubiquitous in real-world scenarios such as user profiling and cold-start recommendation, which severely hinders the practical deployment of graph learning systems (e.g., GNNs). Existing solutions typically rely on diffusion-based structural smoothing (e.g., feature propagation) to impute missing values. However, we find that these approaches suffer from structural overfitting, leading to three progressive challenges: 1) performance degradation on disjoint graphs, 2) loss of semantic diversity due to over-smoothing, and 3) feature distribution shift when generalizing to unseen graph structures (inductive tasks). To address these challenges, we introduce the \textbf{\DART} framework. It begins by employing {\em Global Structural Augmentation (GSA)}, which establishes global correlations to bridge disjoint components and extend diffusion coverage. Building upon this, we design a semantic rectifier based on masked autoencoding. This module learns the latent feature manifold to recover natural semantic details. Crucially, we introduce a test-time distribution rectification mechanism that projects structurally biased features back onto the learned manifold during inference, effectively bridging the inductive distribution gap. Furthermore, considering that synthetic masking fails to reflect real-world sparsity, we present a new dataset \textbf{Sailing} collected from voyage records with naturally missing attributes. Extensive experiments on six public datasets and Sailing demonstrate that \DART significantly outperforms state-of-the-art methods in both transductive and inductive settings. Our code and dataset are available at https://github.com/yfsong00/DART.

Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang• 2025

Related benchmarks

Task	Dataset	Result
Node Classification	Reddit (test)	Accuracy93.95	201
Node Classification	CiteSeer Uniform Missing (test)	Accuracy62.9	16
Inductive Node Classification	Flickr (test)	Accuracy51.97	14
Node Classification	PubMed Structural Missing (test)	Accuracy77.8	14
Node Classification	PubMed Uniform Missing (test)	Accuracy78.56	14
Node Classification	OGBN-Arxiv uniform missing (test)	Accuracy69.54	13
Node Classification	OGBN-Arxiv structural missing (test)	Accuracy68.15	13
Transductive Node Classification	Cora Uniform missing features (test)	Accuracy79.62	8
Transductive Node Classification	Cora Structural missing features (test)	Accuracy77.44	8
Transductive Node Classification	Citeseer Structural missing features (test)	Accuracy60.04	8

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord