SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention
About
Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Graph Regression | Peptides struct LRGB (test) | MAE0.2431 | 238 | |
| Graph Classification | ogbg-molpcba (test) | AP29.84 | 212 | |
| Graph Regression | ZINC 12K (test) | MAE0.059 | 173 | |
| Graph-level classification | OGBG-MOLHIV (test) | AUROC82.47 | 29 | |
| Functional annotation | LRGB Peptides-func (test) | AP69.47 | 6 |