SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations
About
Learning representations on large-sized graphs is a long-standing challenge due to the inter-dependence nature involved in massive data points. Transformers, as an emerging class of foundation encoders for graph-structured data, have shown promising performance on small graphs due to its global attention capable of capturing all-pair influence beyond neighboring nodes. Even so, existing approaches tend to inherit the spirit of Transformers in language and vision tasks, and embrace complicated models by stacking deep multi-head attentions. In this paper, we critically demonstrate that even using a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks where node numbers range from thousand-level to billion-level. This encourages us to rethink the design philosophy for Transformers on large graphs, where the global attention is a computation overhead hindering the scalability. We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model that can efficiently propagate information among arbitrary nodes in one layer. SGFormer requires none of positional encodings, feature/graph pre-processing or augmented loss. Empirically, SGFormer successfully scales to the web-scale graph ogbn-papers100M and yields up to 141x inference acceleration over SOTA Transformers on medium-sized graphs. Beyond current results, we believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Node Classification | Cora | Accuracy84.82 | 885 | |
| Node Classification | Pubmed | Accuracy90.37 | 742 | |
| Node Classification | Chameleon | Accuracy45.21 | 549 | |
| Node Classification | Squirrel | Accuracy42.65 | 500 | |
| Node Classification | ogbn-arxiv (test) | Accuracy72.63 | 382 | |
| Node Classification | Pubmed | Accuracy89.31 | 307 | |
| Node Classification | Citeseer | Accuracy77.24 | 275 | |
| Node Classification | Actor | Accuracy37.9 | 237 | |
| Graph Regression | ZINC (test) | MAE0.306 | 204 | |
| Node Classification | wikiCS | Accuracy80.05 | 198 |