Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention

About

Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.

Dongxin Guo, Jikun Wu, Siu Ming Yiu• 2026

Related benchmarks

TaskDatasetResultRank
Graph RegressionPeptides struct LRGB (test)
MAE0.2431
238
Graph Classificationogbg-molpcba (test)
AP29.84
212
Graph RegressionZINC 12K (test)
MAE0.059
173
Graph-level classificationOGBG-MOLHIV (test)
AUROC82.47
29
Functional annotationLRGB Peptides-func (test)
AP69.47
6
Showing 5 of 5 rows

Other info

Follow for update