Graph Tokenization for Bridging Graphs and Transformers

About

The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state-of-the-art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models. Our code is available at \href{https://github.com/BUPT-GAMMA/Graph-Tokenization-for-Bridging-Graphs-and-Transformers}{\color{blue}here}.

Zeyuan Guo, Enmao Diao, Cheng Yang, Chuan Shi• 2026

Related benchmarks

Task	Dataset	Result
Graph Classification	DD	Accuracy79.6	300
Graph Regression	ZINC	MAE0.131	144
Graph Classification	Peptides func	AP73.1	110
Graph Classification	MolHIV	ROC AUC87.4	102
Classification	colors3	Accuracy100	10
Classification	Twitter	Accuracy65.7	10
Classification	PROTEINS	Accuracy79.1	10
Graph Classification	COIL-DEL	Accuracy89.6	7

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord