Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mitigating Structural Overfitting: A Distribution-Aware Rectification Framework for Missing Feature Imputation

About

Incomplete node features are ubiquitous in real-world scenarios such as user profiling and cold-start recommendation, which severely hinders the practical deployment of graph learning systems (e.g., GNNs). Existing solutions typically rely on diffusion-based structural smoothing (e.g., feature propagation) to impute missing values. However, we find that these approaches suffer from structural overfitting, leading to three progressive challenges: 1) performance degradation on disjoint graphs, 2) loss of semantic diversity due to over-smoothing, and 3) feature distribution shift when generalizing to unseen graph structures (inductive tasks). To address these challenges, we introduce the \textbf{\DART} framework. It begins by employing {\em Global Structural Augmentation (GSA)}, which establishes global correlations to bridge disjoint components and extend diffusion coverage. Building upon this, we design a semantic rectifier based on masked autoencoding. This module learns the latent feature manifold to recover natural semantic details. Crucially, we introduce a test-time distribution rectification mechanism that projects structurally biased features back onto the learned manifold during inference, effectively bridging the inductive distribution gap. Furthermore, considering that synthetic masking fails to reflect real-world sparsity, we present a new dataset \textbf{Sailing} collected from voyage records with naturally missing attributes. Extensive experiments on six public datasets and Sailing demonstrate that \DART significantly outperforms state-of-the-art methods in both transductive and inductive settings. Our code and dataset are available at https://github.com/yfsong00/DART.

Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang• 2025

Related benchmarks

TaskDatasetResultRank
Node ClassificationReddit (test)
Accuracy93.95
137
Node ClassificationCiteSeer Uniform Missing (test)
Accuracy62.9
16
Inductive Node ClassificationFlickr (test)
Accuracy51.97
14
Node ClassificationPubMed Structural Missing (test)
Accuracy77.8
14
Node ClassificationPubMed Uniform Missing (test)
Accuracy78.56
14
Node ClassificationOGBN-Arxiv uniform missing (test)
Accuracy69.54
13
Node ClassificationOGBN-Arxiv structural missing (test)
Accuracy68.15
13
Transductive Node ClassificationCora Uniform missing features (test)
Accuracy79.62
8
Transductive Node ClassificationCora Structural missing features (test)
Accuracy77.44
8
Transductive Node ClassificationCiteseer Structural missing features (test)
Accuracy60.04
8
Showing 10 of 21 rows

Other info

Follow for update