LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (2024)

Duo Wang  Yuan Zuo  Fengzhi Li  Junjie Wu
MIIT Key Laboratory of Data Intelligence and Management, Beihang University
{wangduo58, zuoyuan, lifengzhi, wujj}@buaa.edu.cn
Corresponding author.

Abstract

Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN’s representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method’s effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors. Our code is available athttps://github.com/W-rudder/TEA-GLM.

1 Introduction

Graph Neural Networks (GNNs) have emerged as a pivotal framework in graph machine learning, harnessing the ability to capture intricate message-passing patterns for robust graph representation. These advancements have yielded various GNN architectures, including the Graph Convolution Network (GCN)[20], Graph Attention Network (GAT)[31], and GraphSAGE[11]. Despite their efficacy, GNNs often exhibit limited generalization capabilities, struggling to maintain consistent performance when transitioning across different datasets or downstream tasks[19]. This limitation underscores the necessity for more adaptable and universally applicable models in the graph learning domain.

To mitigate the dependency on labeled data and enhance the resilience of graph models, self-supervised learning has been widely adopted in GNN training. Techniques such as Deep Graph Infomax (DGI)[32] and GraphCL[44] have demonstrated effectiveness by leveraging mutual information maximization and contrastive learning, respectively. However, these methods typically require fine-tuning task-specific heads for downstream applications, which can be resource-intensive and limit their practicality in diverse scenarios. Moreover, graph prompt learning enhances GNN generalization by using unified task templates and meta-learning to adapt to various downstream applications[25, 29], but it often requires extensive fine-tuning and is constrained by the specificity of task types.

In recent years, the remarkable generalization capabilities of Large Language Models (LLMs) have spurred interest in their potential applications within graph machine learning. Some methods attempt to encode graph structures into text for LLM input[3, 10, 33, 23], but these approaches often lead to suboptimal outcomes[18]. Alternatively, using LLMs as enhancers to generate data or node text representations[43, 48, 39, 4, 24] has shown promise but remains constrained by the inherent reliance on GNNs for prediction. Recent efforts[30, 2] to use LLMs as predictors have demonstrated potential. However, their performance often remains unstable due to the challenge of producing transferable graph representations that work effectively for LLMs across diverse tasks and datasets.

In light of these challenges, we propose a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM). Inspired by the zero-shot capabilities of instruction-fine-tuned LLMs[34], TEA-GLM leverages LLMs as cross-dataset and cross-task zero-shot predictors for graph machine learning. The core idea is to pretrain a GNN and align its representations with the token embeddings of an LLM. This alignment enables the GNN to effectively utilize the LLM’s pretrained knowledge, allowing it to generalize across different datasets and tasks without task-specific fine-tuning. Additionally, we train a linear projector to convert graph representations into a fixed number of token embeddings, which are then incorporated into a unified instruction designed for various graph tasks at different levels. Experiments show TEA-GLM achieves superior performance in zero-shot scenarios and when encountering unseen tasks, offering a more generalized and efficient solution for graph zero-shot learning. Our contributions are summarized as follows:

  • We introduce TEA-GLM, a novel framework that aligns GNN representations with LLM token embeddings, enabling cross-dataset and cross-task zero-shot learning for graph machine learning.

  • We propose a linear projector that maps graph representations into a fixed number of graph token embeddings. These embeddings are incorporated into a unified instruction designed for various graph tasks at different levels, enhancing the model’s generalization capabilities.

  • Our extensive experiments demonstrate that TEA-GLM significantly outperforms state-of-the-art methods on unseen datasets and tasks.

2 Methodology

In this section, we introduce TEA-GLM, a novel framework designed for cross-dataset and cross-task zero-shot graph machine learning. TEA-GLM consists of two main components: a Graph Neural Network (GNN) to derive node representations from the graph, and a Large Language Model (LLM) to perform zero-shot tasks such as node classification and link prediction. Our methodology involves two key stages: enhanced self-supervised learning of the GNN, where feature-wise contrastive learning with LLM’s token embeddings is proposed, and training a linear projector to map graph representations into a fixed number of graph token embeddings by designing an instruction that is suitable for various graph tasks at different levels. The framework of our proposed method is illustrated in Fig.1.

2.1 Notations

Formally, a graph is denoted as 𝒢=(𝒱,,𝐀,𝐗)𝒢𝒱𝐀𝐗\mathcal{G}=\left(\mathcal{V},\mathcal{E},\mathbf{A},\mathbf{X}\right)caligraphic_G = ( caligraphic_V , caligraphic_E , bold_A , bold_X ), where 𝒱={v1,v2,,v|𝒱|}𝒱subscript𝑣1subscript𝑣2subscript𝑣𝒱\mathcal{V}=\left\{v_{1},v_{2},\dots,v_{\left|\mathcal{V}\right|}\right\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT | caligraphic_V | end_POSTSUBSCRIPT } with |𝒱|=N𝒱𝑁\left|\mathcal{V}\right|=N| caligraphic_V | = italic_N indicating the total number of nodes and ={e1,e2,,e||}subscript𝑒1subscript𝑒2subscript𝑒\mathcal{E}=\left\{e_{1},e_{2},\dots,e_{\left|\mathcal{E}\right|}\right\}caligraphic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT } representing the sets of nodes and edges, respectively. The adjacency matrix is denoted as 𝐀N×N𝐀superscript𝑁𝑁\mathbf{A}\in\mathbb{R}^{N\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, with 𝐀ij=1subscript𝐀𝑖𝑗1\mathbf{A}_{ij}=1bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 iff (vi,vj)subscript𝑣𝑖subscript𝑣𝑗(v_{i},v_{j})\in\mathcal{E}( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E. The feature matrix 𝐗N×FN𝐗superscript𝑁subscript𝐹𝑁\mathbf{X}\in\mathbb{R}^{N\times F_{N}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT contains the attribute or feature information associated with each node, where 𝒙𝒊FNsubscript𝒙𝒊superscriptsubscript𝐹𝑁\boldsymbol{x_{i}}\in\mathbb{R}^{F_{N}}bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the feature of visubscript𝑣𝑖{v_{i}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and FNsubscript𝐹𝑁F_{N}italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT represents the dimensionality of features.

LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (1)

2.2 Token embeddings-aligned graph self-supervised learning

Given the increasing model sizes and data volumes in recent years, self-supervised learning has become a prominent research focus due to the scarcity of labeled data. In this context, we propose a contrastive learning method to obtain more transferable node representations suitable for use with large language models (LLMs). Our approach leverages instance-wise contrastive learning and introduces a feature-wise contrastive learning method that maps node representations to the textual embedding space of the LLM.

2.2.1 Instance-wise contrastive learning with structural information

To alleviate the need for labeled data and enhance model generalization capability, we employ self-supervised learning for pre-training. To better extract structural information from the graph, we follow the work of [52] to generate two views of 𝒢𝒢\mathcal{G}caligraphic_G, denoted as 𝒢1subscript𝒢1{\mathcal{G}_{1}}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}_{2}}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, for contrastive learning. Specifically, we adopt the Removing Edges (RE) and Masking Node Features (MF) methods to generate different views. The RE strategy samples a random masking matrix 𝐑~{0,1}N×N~𝐑superscript01𝑁𝑁\widetilde{\mathbf{R}}\in\left\{0,1\right\}^{N\times N}over~ start_ARG bold_R end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT to mask the raw adjacency matrix, computed as:

𝐀~=𝐀𝐑~,~𝐀𝐀~𝐑\widetilde{\mathbf{A}}=\mathbf{A}\circ\widetilde{\mathbf{R}},over~ start_ARG bold_A end_ARG = bold_A ∘ over~ start_ARG bold_R end_ARG ,(1)

where \circ denotes the Hadamard product. The MF strategy samples a random mask vector 𝒎~{0,1}F~𝒎superscript01𝐹\widetilde{\boldsymbol{m}}\in\left\{0,1\right\}^{F}over~ start_ARG bold_italic_m end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT. The generated node features 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG are computed by:

𝐗~=[𝒙𝟏𝒎~;𝒙𝟐𝒎~;;𝒙𝑵𝒎~].~𝐗subscript𝒙1~𝒎subscript𝒙2~𝒎subscript𝒙𝑵~𝒎\widetilde{\mathbf{X}}=\left[\boldsymbol{x_{1}}\circ\widetilde{\boldsymbol{m}}%;\boldsymbol{x_{2}}\circ\widetilde{\boldsymbol{m}};\ldots;\boldsymbol{x_{N}}%\circ\widetilde{\boldsymbol{m}}\right].over~ start_ARG bold_X end_ARG = [ bold_italic_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ∘ over~ start_ARG bold_italic_m end_ARG ; bold_italic_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ∘ over~ start_ARG bold_italic_m end_ARG ; … ; bold_italic_x start_POSTSUBSCRIPT bold_italic_N end_POSTSUBSCRIPT ∘ over~ start_ARG bold_italic_m end_ARG ] .(2)

Thus, we obtain two views of 𝒢𝒢\mathcal{G}caligraphic_G, denoted as 𝒢1=(𝐗~1,𝐀~1)subscript𝒢1subscript~𝐗1subscript~𝐀1{\mathcal{G}_{1}}=\left(\widetilde{\mathbf{X}}_{1},\widetilde{\mathbf{A}}_{1}\right)caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG bold_A end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and 𝒢2=(𝐗~2,𝐀~2)subscript𝒢2subscript~𝐗2subscript~𝐀2{\mathcal{G}_{2}}=\left(\widetilde{\mathbf{X}}_{2},\widetilde{\mathbf{A}}_{2}\right)caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over~ start_ARG bold_A end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Then, we use a graph encoder to derive node representations:

𝐔=fGNN(𝐗~,𝐀~)N×FU,subscript𝐔subscript𝑓GNNsubscript~𝐗subscript~𝐀superscript𝑁subscript𝐹𝑈\mathbf{U_{\ast}}=f_{\mathrm{GNN}}\left(\widetilde{\mathbf{X}}_{\ast},%\widetilde{\mathbf{A}}_{\ast}\right)\in\mathbb{R}^{N\times F_{U}},bold_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_GNN end_POSTSUBSCRIPT ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over~ start_ARG bold_A end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,(3)

Where FUsubscript𝐹𝑈F_{U}italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT is the dimension size of node representations. Here, {1,2}\ast\in\left\{1,2\right\}∗ ∈ { 1 , 2 } represents different views of the graph.

We employ a contrastive objective to distinguish the embeddings of the same node in these two different views from other node embeddings. For node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its node embedding generated in one view, 𝒖𝒊subscript𝒖𝒊\boldsymbol{u_{i}}bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, is treated as the anchor, while the embedding generated in the other view, 𝒖𝒊superscriptsubscript𝒖𝒊\boldsymbol{u_{i}}^{\prime}bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, forms the positive sample. Embeddings of other nodes in the same view are regarded as intra-view negative samples, while embeddings of other nodes in the other view are regarded as inter-view negative samples. The contrastive loss is defined as:

(𝒖𝒊,𝒖𝒊)=logeθ(𝒖𝒊,𝒖𝒊)/τeθ(𝒖𝒊,𝒖𝒊)/τthe positive pair+j=1N1[ji]eθ(𝒖𝒊,𝒖𝒋)/τintra-view negative pairs+j=1N1[ji]eθ(𝒖𝒊,𝒖𝒋)/τinter-view negative pairs,subscript𝒖𝒊superscriptsubscript𝒖𝒊superscript𝑒𝜃subscript𝒖𝒊superscriptsubscript𝒖𝒊𝜏subscriptsuperscript𝑒𝜃subscript𝒖𝒊superscriptsubscript𝒖𝒊𝜏the positive pairsubscriptsuperscriptsubscript𝑗1𝑁subscript1delimited-[]𝑗𝑖superscript𝑒𝜃subscript𝒖𝒊subscript𝒖𝒋𝜏intra-view negative pairssubscriptsuperscriptsubscript𝑗1𝑁subscript1delimited-[]𝑗𝑖superscript𝑒𝜃subscript𝒖𝒊superscriptsubscript𝒖𝒋𝜏inter-view negative pairs\ell\left(\boldsymbol{u_{i}},\boldsymbol{u_{i}}^{\prime}\right)=\log\frac{e^{%\theta\left(\boldsymbol{u_{i}},\boldsymbol{u_{i}}^{\prime}\right)/\tau}}{%\underbrace{e^{\theta\left(\boldsymbol{u_{i}},\boldsymbol{u_{i}}^{\prime}%\right)/\tau}}_{\text{the positive pair}}+\underbrace{\sum_{j=1}^{N}\mathrm{1}%_{[j\neq i]}e^{\theta\left(\boldsymbol{u_{i}},\boldsymbol{u_{j}}\right)/\tau}}%_{\text{intra-view negative pairs}}+\underbrace{\sum_{j=1}^{N}\mathrm{1}_{[j%\neq i]}e^{\theta\left(\boldsymbol{u_{i}},\boldsymbol{u_{j}}^{\prime}\right)/%\tau}}_{\text{inter-view negative pairs}}},roman_ℓ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG under⏟ start_ARG italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT the positive pair end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT intra-view negative pairs end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT inter-view negative pairs end_POSTSUBSCRIPT end_ARG ,(4)

where 1[ji]{0,1}subscript1delimited-[]𝑗𝑖01\mathrm{1}_{[j\neq i]}\in\left\{0,1\right\}1 start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT ∈ { 0 , 1 } is an indicator function that equals 1 iff ji𝑗𝑖j\neq iitalic_j ≠ italic_i, θ(,)𝜃\theta\left(\cdot,\cdot\right)italic_θ ( ⋅ , ⋅ ) is the cosine similarity function, and τ𝜏\tauitalic_τ is a temperature parameter. The loss for the other view is similarly defined, and the overall objective inssubscript𝑖𝑛𝑠\mathcal{L}_{ins}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT is the average of all instances:

ins=12Ni=1N[(𝒖𝒊,𝒖𝒊)+(𝒖𝒊,𝒖𝒊)].subscript𝑖𝑛𝑠12𝑁superscriptsubscript𝑖1𝑁delimited-[]subscript𝒖𝒊superscriptsubscript𝒖𝒊superscriptsubscript𝒖𝒊subscript𝒖𝒊\mathcal{L}_{ins}=\frac{1}{2N}\sum_{i=1}^{N}\left[\ell\left(\boldsymbol{u_{i}}%,\boldsymbol{u_{i}}^{\prime}\right)+\ell\left(\boldsymbol{u_{i}}^{\prime},%\boldsymbol{u_{i}}\right)\right].caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ roman_ℓ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + roman_ℓ ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ] .(5)

To enhance the scalability of our method for large-scale graphs, we employ the subsampling approach proposed by[11]. Both the RE and MF methods, along with the loss function described in Equation4, are seamlessly adaptable to the sampled subgraphs.

2.2.2 Feature-wise contrastive learning with token embeddings

Instance-wise contrastive learning relies heavily on individual instances, which can cause transfer issues when transitioning to other datasets. Moreover, there is a significant gap between the obtained node representations and the semantic space of LLMs. To address these issues, we propose feature-wise contrastive learning with token embeddings.

Feature-wise contrastive loss breaks the independence between instances. For the feature matrix 𝐔subscript𝐔\mathbf{U_{\ast}}bold_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, we denote the columns in different views as 𝒎𝒊𝐔𝟏subscript𝒎𝒊superscriptsubscript𝐔1top\boldsymbol{m_{i}}\in\mathbf{U_{1}^{\top}}bold_italic_m start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ bold_U start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝒏𝒊𝐔𝟐subscript𝒏𝒊superscriptsubscript𝐔2top\boldsymbol{n_{i}}\in\mathbf{U_{2}^{\top}}bold_italic_n start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ bold_U start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Here, 𝒎𝒊,𝒏𝒊Nsubscript𝒎𝒊subscript𝒏𝒊superscript𝑁\boldsymbol{m_{i}},\boldsymbol{n_{i}}\in\mathbb{R}^{N}bold_italic_m start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_n start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The loss is denoted as feasubscript𝑓𝑒𝑎\mathcal{L}_{fea}caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT, and is calculated as:

fea=1FUi=1FUlogeθ(𝒎𝒊,𝒏𝒊)/τj=1FU[eθ(𝒎𝒊,𝒎𝒋)/τ+eθ(𝒎𝒊,𝒏𝒋)/τ].subscript𝑓𝑒𝑎1subscript𝐹𝑈superscriptsubscript𝑖1subscript𝐹𝑈superscript𝑒𝜃subscript𝒎𝒊subscript𝒏𝒊𝜏superscriptsubscript𝑗1subscript𝐹𝑈delimited-[]superscript𝑒𝜃subscript𝒎𝒊subscript𝒎𝒋𝜏superscript𝑒𝜃subscript𝒎𝒊subscript𝒏𝒋𝜏\mathcal{L}_{fea}=\frac{1}{F_{U}}\sum_{i=1}^{F_{U}}\log\frac{e^{\theta\left(%\boldsymbol{m_{i}},\boldsymbol{n_{i}}\right)/\tau}}{\sum_{j=1}^{F_{U}}\left[{e%^{\theta\left(\boldsymbol{m_{i}},\boldsymbol{m_{j}}\right)/\tau}}+{e^{\theta%\left(\boldsymbol{m_{i}},\boldsymbol{n_{j}}\right)/\tau}}\right]}.caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_m start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_n start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_m start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_m start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_θ ( bold_italic_m start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_n start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT ] end_ARG .(6)

To map node representations to the semantic space of LLMs, we use the principal components of the token embeddings of LLMs as coordinate axes. This approach ensures that the representations of similar instances are closely aligned in the textual embedding space. This helps alleviate the inconsistency in optimization objectives during graph self-supervised learning due to the gap between node representations and the text embedding space.

Specifically, we first use principal component analysis (PCA) to obtain the P𝑃Pitalic_P principal components, denoted as 𝐂P×FL𝐂superscript𝑃subscript𝐹𝐿\mathbf{C}\in\mathbb{R}^{P\times F_{L}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the dimension size of token embeddings of LLM. Then, we map node representations by:

𝐔~=𝐔×𝐂.subscript~𝐔subscript𝐔superscript𝐂top\widetilde{\mathbf{U}}_{\ast}=\mathbf{U}_{\ast}\times\mathbf{C}^{\top}.over~ start_ARG bold_U end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_U start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT × bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .(7)

To map the node representations obtained from the GNN using principal components, we set the output dimension of the GNN to be equal to the token embeddings’ dimension(i.e., FU=FLsubscript𝐹𝑈subscript𝐹𝐿F_{U}=F_{L}italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT). The columns of the mapped feature matrix 𝐔~subscript~𝐔\widetilde{\mathbf{U}}_{\ast}over~ start_ARG bold_U end_ARG start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, denoted as 𝒎~isubscript~𝒎𝑖\widetilde{\boldsymbol{m}}_{i}over~ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒏~isubscript~𝒏𝑖\widetilde{\boldsymbol{n}}_{i}over~ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, are fed into feasubscript𝑓𝑒𝑎\mathcal{L}_{fea}caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT. Therefore, the final contrastive loss for graph self-supervised learning is the average of Equation 4 and Equation 6:

=12(ins+fea).12subscript𝑖𝑛𝑠subscript𝑓𝑒𝑎\mathcal{L}=\frac{1}{2}\left(\mathcal{L}_{ins}+\mathcal{L}_{fea}\right).caligraphic_L = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT ) .(8)

Remark: The introduction of feature-wise contrastive learning with token embeddings successfully addresses the semantic space discrepancy between graph node representations and LLM token embeddings. Our method enables the direct and simple use of graph structural and text information obtained by GNN in LLMs, thereby avoiding the significant generalization loss associated with complex modality alignment training during the fine-tuning process. Its role in fine-tuning will be further described in Sec.2.3.2 and validated by experiments. Additionally, the feature-wise contrastive method itself exhibits stronger generalization, allowing it to perform well on unseen instances(or tasks) rather than relying on trained instances(or tasks).

2.3 Alignment tuning

The development of LLMs has introduced a new paradigm for graph machine learning. However, existing research[18] indicates that LLMs alone cannot fully comprehend graph structures and their underlying information. To enable LLMs to more effectively capture information and improve their performance in cross-dataset and cross-task zero-shot learning, it is essential to design specific methods for LLMs to incorporate graph information suitably. To this end, we propose an alignment tuning method that includes specially designed instructions for various graph tasks at different levels, as well as a graph representation to graph token embeddings mechanism to integrate graph information.

2.3.1 Instructions design

The instruction we designed can be divided into two parts: one part provides graph information, and the other part describes the task. Here, we take a citation graph as an example, where nodes are papers, and relations are citations, to introduce the instruction.

Graph information provision

The graph information provision in the instructions for node, edge, and graph-level tasks is presented as follows: Given the representation of a paper/two papers/a paper set: 𝗀𝗋𝖺𝗉𝗁delimited-⟨⟩𝗀𝗋𝖺𝗉𝗁\mathsf{\langle graph\rangle}⟨ sansserif_graph ⟩, with the following information:\nTitle: First Paper: {𝗍𝗂𝗍𝗅𝖾𝟣}subscript𝗍𝗂𝗍𝗅𝖾1\mathsf{\left\{title_{1}\right\}}{ sansserif_title start_POSTSUBSCRIPT sansserif_1 end_POSTSUBSCRIPT } …\n, where 𝗀𝗋𝖺𝗉𝗁delimited-⟨⟩𝗀𝗋𝖺𝗉𝗁\mathsf{\langle graph\rangle}⟨ sansserif_graph ⟩ is the placeholder for graph inputs (see Sect.2.3.2), and {𝗍𝗂𝗍𝗅𝖾𝟣}subscript𝗍𝗂𝗍𝗅𝖾1\mathsf{\left\{title_{1}\right\}}{ sansserif_title start_POSTSUBSCRIPT sansserif_1 end_POSTSUBSCRIPT } is the node text information.

Note that, different from most work which use LLM as a predictor, the instruction we designed uses only the title of a paper node, excluding more extensive textual information such as its abstract or description. In fact, reducing the amount of input text not only does not decrease the model’s performance but actually improves it. [18] confirmed through experiments that LLMs benefit from structural information only when the target node lacks sufficient phrases for reasonable predictions. Therefore, using only titles as text input can help LLMs extract more critical information from graph information. The complete instruction for the tasks of node classification and link prediction in citation networks is shown in AppendixD.

Task description

To achieve cross-dataset capability, where the model can be trained on one graph dataset and then perform reasoning on any other dataset, the instruction is designed to include not only the task description itself but also the set of alternative answers. Using the node classification task on the Arxiv dataset (see Sect.3.1) as an example, the instruction is structured as follows: Which arXiv CS sub-category does this paper belong to? Please directly give the most likely answer from the following sub-categories: {𝖺𝗇𝗌}𝖺𝗇𝗌\mathsf{\left\{ans\right\}}{ sansserif_ans }, where {𝖺𝗇𝗌}𝖺𝗇𝗌\mathsf{\left\{ans\right\}}{ sansserif_ans } represents the set of alternative answers, which varies across datasets. Including alternative answers enables the model to learn the task of “reasoning the answer from a given set according to the task” rather than memorizing answers for a particular dataset, thus facilitating reasoning across datasets.

2.3.2 Graph token embeddings

The token embeddings of graph mentioned previously, i.e., 𝗀𝗋𝖺𝗉𝗁delimited-⟨⟩𝗀𝗋𝖺𝗉𝗁\mathsf{\langle graph\rangle}⟨ sansserif_graph ⟩, are crucial for incorporating graph information and enabling the model’s generalization. We use a projector to map central node representations into K𝐾Kitalic_K graph token embeddings and replace 𝗀𝗋𝖺𝗉𝗁delimited-⟨⟩𝗀𝗋𝖺𝗉𝗁\mathsf{\langle graph\rangle}⟨ sansserif_graph ⟩ with these tokens. Kindly note that, we map the representations to fixed number of token embeddings regardless of the task type. For example, for node-level tasks, we map the central node representation to K𝐾Kitalic_K token embeddings; for edge-level tasks, we pool the representations of the two nodes of the target edge and then map this pooled representation to K𝐾Kitalic_K token embeddings; for graph-level tasks, similar approach can be applied. In this way, we unify the instruction of graph tasks at different levels. Thanks to the text-aligned contrastive learning, a linear projector is enough to capture the map relationship without tuning LLM:

𝐇token=fLinear(𝒖𝒊)subscript𝐇𝑡𝑜𝑘𝑒𝑛subscript𝑓Linearsubscript𝒖𝒊\mathbf{H}_{token}=f_{\mathrm{Linear}}\left(\boldsymbol{u_{i}}\right)bold_H start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_Linear end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT )(9)

where 𝒖𝒊𝐔subscript𝒖𝒊𝐔\boldsymbol{u_{i}}\in\mathbf{U}bold_italic_u start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ bold_U, 𝐇tokenK×FLsubscript𝐇𝑡𝑜𝑘𝑒𝑛superscript𝐾subscript𝐹𝐿\mathbf{H}_{token}\in\mathbb{R}^{K\times F_{L}}bold_H start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the dimension size of token embedding of LLM, and fLinear()subscript𝑓Linearf_{\mathrm{Linear}}(\cdot)italic_f start_POSTSUBSCRIPT roman_Linear end_POSTSUBSCRIPT ( ⋅ ) is a linear layer.

Remark: This approach offers three primary advantages: (i) When handling tasks at different levels, the changes to the instructions are minimal. This consistency facilitates the transfer of knowledge learned during training to unseen tasks in large language models (LLMs); (ii) The fixed number of token embeddings can be seen as a conditional soft prompt. Unlike traditional soft prompts, learning at the instance level reduces the risk of overfitting to specific datasets or tasks, thereby enhancing generalization to unseen datasets and tasks; (iii) Different from current work which intends to include the representations of all nodes in the subgraph, we only map the representations of the central node to tokens, since there has enough information carried by message passing of GNN. This method is more efficient, and it offers greater generalizability and practicality.

2.3.3 Training and evaluation strategy

To ensure compatibility and facilitate comparisons across various datasets, we map the node features into a consistent vector space. Specifically, we employ a pretrained BERT model [8] to encode the raw text associated with each node, thereby generating the node features. We then pretrain the graph model using contrastive learning with the loss function defined in Equation 8 on a single dataset. After pretraining, the model parameters are fixed. We utilize the pretrained model to obtain node representations and follow the instructions in Section 2.3.1 to train the linear projector on specific tasks within the same dataset. Finally, we evaluate the performance of our model on unseen datasets and tasks. Throughout all phases, the parameters of the language model remain fixed. We use GraphSAGE [11] as our graph encoder and Vicuna-7B-v1.5 [7] as the foundational language model.

3 Experimental results

In this section, comprehensive experiments are conducted to validate the effectiveness of TEA-GLM. These experiments aim to investigate the following research questions:

  • RQ1:

    How effective is TEA-GLM in handling the cross-dataset zero-shot learning problem?

  • RQ2:

    How well does TEA-GLM transfer knowledge when adapted to an unseen task and dataset in a zero-shot setting?

  • RQ3:

    What is the contribution of the feature-wise contrastive learning and graph token embeddings to the zero-shot learning ability of TEA-GLM?

3.1 Experimental setup

Datasets

We test TEA-GLM across eight widely used datasets spanning two distinct domains. Within the citation domain, we employ Arxiv[17], Pubmed[13], and an expanded version of Cora[35] with an increased range of classes and larger scale. In these datasets, each node represents an individual paper, with edges indicating citation relationships. In the e-commerce domain, we utilize datasets from the TAG benchmark[41], including Children (Book-Children), History (Book-History), Computer (Ele-Computer), Photo (Ele-Photo), and Sports (Sports-Fitness). Here, nodes represent distinct products, while edges denote co-viewing or co-purchasing between two products. AppendixA presents the statistics for these datasets.

Baselines

We conduct a comprehensive comparison of TEA-GLM with various categories of baseline methods: (i) Non-graph neural network approaches, such as MLP, which employs a Multilayer Perceptron for node representation; (ii) Supervised methods, including GCN[20], GraphSAGE[11], and GAT[31]; (iii) Self-supervised methods like DGI[32], which maximizes mutual information to learn node representations without relying on ground truth labels; (iv) Graph knowledge distillation frameworks: GKD[42], which distills knowledge from a teacher GNN trained on a complete graph to a student GNN operating on a smaller or sparser graph; GLNN[51], a method combining the advantages of graph neural networks and MLPs using knowledge distillation, aimed at reducing dependency on the inference graph; (v) Graph transformer networks, including NodeFormer[36] and DIFFormer[37]; (vi) Large language models, such as Vicuna-7B-v1.5; (vii) The latest models equipped with transfer and zero-shot capabilities, such as OFA[24], GraphGPT[30], and LLaGA[2].

Implementation details

For datasets within the citation domain, we follow the data split methodology outlined in GraphGPT[30]. For those within the e-commerce domain, we utilize scripts provided by the TAG benchmark[41] to generate data splits. To ensure comparability among different methods, identical data splits are applied to all models. To assess the performance of TEA-GLM, we employ three commonly adopted evaluation metrics: Accuracy and Macro F1 for node classification, and AUC (Area Under the Curve) for link prediction. To ensure result robustness, we conduct five experiments with random seed values ranging from 0 to 4 and report the mean and standard deviation of the results. Due to the limited number of pages, several experimental results, such as Macro F1 results of node classification(AppendixB.2), legality rate of valid answers produced by the LLM(AppendixB.1), and parameter sensitivity analysis(AppendixC), are reported in Appendix.

In the pre-training phase of the GNN, we set the GNN layers to 2. We use a batch size of 512 for 60 epochs and a learning rate of 2×1022superscript1022\times 10^{-2}2 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. During the training of the linear projector, we configure a batch size of 2 per GPU for one epoch, with a learning rate of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The Adam optimizer is employed for all approaches. For baseline models, we adjust hyperparameters and utilize the optimal settings. All experiments are conducted on 2 NVIDIA A100 GPUs with 80GB memory each, using CUDA version 11.7.

Model typeModelCitationE-commerce
PubmedCoraChildrenHistoryPhotoSports
MLP0.323±plus-or-minus\pm±0.0270.021±plus-or-minus\pm±0.0060.029±plus-or-minus\pm±0.0370.080±plus-or-minus\pm±0.0410.110±plus-or-minus\pm±0.0700.042±plus-or-minus\pm±0.021
GNN aspredictorGCN0.288±plus-or-minus\pm±0.0920.017±plus-or-minus\pm±0.0040.030±plus-or-minus\pm±0.0180.063±plus-or-minus\pm±0.0420.103±plus-or-minus\pm±0.0470.042±plus-or-minus\pm±0.025
GraphSAGE0.316±plus-or-minus\pm±0.0580.014±plus-or-minus\pm±0.0070.008±plus-or-minus\pm±0.0070.195±plus-or-minus\pm±0.2060.056±plus-or-minus\pm±0.0550.051±plus-or-minus\pm±0.015
GAT0.343±plus-or-minus\pm±0.0640.016±plus-or-minus\pm±0.0040.086±plus-or-minus\pm±0.0840.172±plus-or-minus\pm±0.0980.050±plus-or-minus\pm±0.0270.142±plus-or-minus\pm±0.138
DGI0.329±plus-or-minus\pm±0.1030.026±plus-or-minus\pm±0.0090.082±plus-or-minus\pm±0.0350.218±plus-or-minus\pm±0.1680.224±plus-or-minus\pm±0.1270.049±plus-or-minus\pm±0.017
GKD0.399±plus-or-minus\pm±0.0330.042±plus-or-minus\pm±0.0080.202±plus-or-minus\pm±0.0640.339±plus-or-minus\pm±0.1380.166±plus-or-minus\pm±0.0860.208±plus-or-minus\pm±0.077
GLNN0.390±plus-or-minus\pm±0.0110.031±plus-or-minus\pm±0.0060.187±plus-or-minus\pm±0.0120.283±plus-or-minus\pm±0.0210.403±plus-or-minus\pm±0.0190.317±plus-or-minus\pm±0.048
NodeFormer0.308±plus-or-minus\pm±0.0930.016±plus-or-minus\pm±0.0070.048±plus-or-minus\pm±0.0280.168±plus-or-minus\pm±0.1270.073±plus-or-minus\pm±0.0150.165±plus-or-minus\pm±0.057
DIFFormer0.361±plus-or-minus\pm±0.0710.029±plus-or-minus\pm±0.0140.129±plus-or-minus\pm±0.0300.275±plus-or-minus\pm±0.1710.321±plus-or-minus\pm±0.0550.306±plus-or-minus\pm±0.131
OFA0.314±plus-or-minus\pm±0.0590.130±plus-or-minus\pm±0.0190.064±plus-or-minus\pm±0.0860.052±plus-or-minus\pm±0.0490.340±plus-or-minus\pm±0.0260.101±plus-or-minus\pm±0.071
LLM aspredictorVicuna-7B-v1.50.719±plus-or-minus\pm±0.0100.156±plus-or-minus\pm±0.0010.270±plus-or-minus\pm±0.0010.363±plus-or-minus\pm±0.0010.378±plus-or-minus\pm±0.0040.370±plus-or-minus\pm±0.001
Vicuna-7B-SPT0.768±plus-or-minus\pm±0.0360.168±plus-or-minus\pm±0.0180.227±plus-or-minus\pm±0.0150.281±plus-or-minus\pm±0.0880.350±plus-or-minus\pm±0.0610.230±plus-or-minus\pm±0.018
GraphGPT-std0.7010.126----
GraphGPT-cot0.5210.181----
LLaGA0.793±plus-or-minus\pm±0.0360.168±plus-or-minus\pm±0.0320.199±plus-or-minus\pm±0.0070.146±plus-or-minus\pm±0.0670.276±plus-or-minus\pm±0.0690.352±plus-or-minus\pm±0.033
TEA-GLM0.848±plus-or-minus\pm±0.0100.202±plus-or-minus\pm±0.0140.271±plus-or-minus\pm±0.0100.528±plus-or-minus\pm±0.0580.497±plus-or-minus\pm±0.0270.404±plus-or-minus\pm±0.010

3.2 Cross-dataset zero-shot ability (RQ1)

We train all methods on the Arxiv and Computer, respectively, followed by an evaluation of their zero-shot performance on datasets from the same domain. Zero-shot learning presents challenges for GNN-based models, particularly regarding variations in the number of classes across different datasets. To address this, we adopt the setting outlined in GraphGPT[30]. For each target dataset, we utilize the GNN backbone trained on the source dataset along with a classifier trained with target data, typically a linear layer. Due to the considerable time cost associated with training and evaluating GraphGPT on e-commerce datasets, we only report its performance on citation datasets as provided in their paper. “-std” and “-cot” denote the use of the standard procedure of dual-stage graph instruction tuning and COT instruction datasets generated by LLM, respectively. To demonstrate the difference between our work and Soft Prompt Tuning, we fine-tuned vicuna-7b-v1.5 using Soft Prompt and reported the results. The Accuracy results are presented in Table1. As mentioned earlier, we report the Macro F1 results in AppendixB.2 and report results on two training datasets in AppendixB.3.

The results clearly demonstrate that TEA-GLM outperforms all state-of-the-art (SOTA) models, resulting in significant improvements. Comparative analysis with baseline models across all datasets highlights the robust generalization capability of TEA-GLM. Models utilizing GNN as a predictor face challenges in achieving cross-dataset transferability with traditional supervised and self-supervised learning methods. Even recently developed robust GNN-based models, such as NodeFormer, DIFFormer, and GKD, encounter similar issues. In the case of OFA, a recent framework for cross-domain learning, strong transferability is observed between topic-related datasets such as Arxiv and Cora (both related to computer science). Nevertheless, its generalization performance notably decreases on datasets with lower topic relevance, such as those in the e-commerce domain.

LLM-based solutions, such as Vicuna-7B, demonstrate consistent performance across various datasets. Nevertheless, their predictive capabilities are confined to text information alone. Vicuna-7B-SPT also fails to achieve transferability on e-commerce datasets, indicating that soft prompt tuning alone is insufficient when relying solely on node texts. This suggests that graph tokens indeed contain transferable graph information, enabling the LLM to make more accurate predictions. In contrast, GNN-LLM-combined solutions that use LLM as a predictor demonstrate generalization ability but often face limitations. For instance, GraphGPT tends to underperform compared to Vicuna-7B, due to the lack of a graph foundation model. Instead of relying on a graph foundation model, LLaGA directly maps node representations without GNN and can generalize on citation datasets. However, it demonstrates limited generalization capability across e-commerce datasets, which are more challenging due to highly irrelevant topics. TEA-GLM, on the other hand, utilizes principal components of token embeddings of LLMs to constrain representations learned by GNN, helping the graph representations well transfer to other datasets. Experimental results validate the superior generalization capabilities of TEA-GLM, achieved with less textual data and fewer parameters.

3.3 Cross-task zero-shot ability (RQ2)

We employ models trained on node classification tasks directly for link prediction tasks without any fine-tuning. We omit the comparison with models utilizing GNN as a predictor, as conducting cross-task evaluation of these models without fine-tuning poses a significant challenge, given that different tasks typically correspond to different task heads. Here, we contrast TEA-GLM with OFA, which similarly enables cross-task testing without the need for fine-tuning. Additionally, we compare TEA-GLM with Vicuna-7B and methods that utilize LLM as a predictor, such as GraphGPT and LLaGA. For GraphGPT, we utilize the checkpoint released by the author trained on Arxiv and report the results on citation datasets. The results are reported in Table2.

In the case of OFA, although this framework facilitates cross-domain and cross-task learning, it exhibits negative transfer when lacking task-relevant data, particularly on unseen tasks. Benefiting from the generalization capability of large language models, both the fine-tuned and non-fine-tuned versions of Vicuna do not experience negative transfer. However, due to the absence of graph information, its predictions often appear random. Conversely, GraphGPT shows transferability with familiar datasets, yet its performance declines when dealing with unseen datasets(Pubmed and Cora). Due to the absence of GNN for filtering and aggregating graph information, LLaGA demonstrates unstable performance. While it exhibits cross-task transferability on citation datasets, its performance is poor on most e-commerce datasets. In contrast, TEA-GLM consistently outperforms all baseline methods on both unseen datasets and tasks, except for the results on Sports, indicating the stronger generalization ability of TEA-GLM.

ModelCitationE-commerce
ArxivPubmedCoraChildrenHistoryComputerPhotoSports
OFA0.4690.4810.4920.4840.4310.4610.4590.517
Vicuna-7B-v1.50.5130.5430.5270.5000.5150.5020.5010.502
Vicuna-7B-SPT0.5370.5350.5650.5440.5430.5090.5010.508
GraphGPT-std0.6490.5010.520-----
LLaGA0.5700.5690.5370.4220.4490.4790.4780.597
TEA-GLM0.6570.6890.5860.5710.5790.5540.5450.553

3.4 Ablation study (RQ3)

We conduct an ablation study to discuss two key components of our model: feature-wise contrastive learning and graph token embeddings. Here, we directly remove these two components from our model and then test the model’s performance on cross-dataset and cross-task evaluations. The results are shown in Figure2. “w/o FC” means that we pretrain the GNN without feature-wise contrastive learning, while “w/o GT” means predicting without graph token embeddings.

LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (2)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (3)

Without graph token embeddings, large language models lack crucial information from the graph, leading to a significant decline in performance on both node-level and edge-level tasks. GNNs pre-trained with feature-wise contrastive learning can obtain node representations aligned with the text space, enabling cross-dataset and cross-task generalization through a simple linear layer. When the feature-wise constraint for pre-training is absent, the model’s performance on the seen datasets (Arxiv and Computer) for the training task improves slightly. However, its performance on unseen datasets declines. Although it remains relatively stable when handling tasks of the same category, its performance decreases notably when dealing with unseen tasks (link prediction). These results indicate that alignment between graph representation and LLM’s token embeddings via feature-wise contrastive learning is important for cross-task zero-shot transfer.

4 Related work

4.1 Graph neural networks

In the field of graph machine learning, Graph Neural Networks(GNNs) have garnered significant attention[5, 22, 28, 40, 9, 6, 46, 1]. The primary strategy of most GNNs is to capture underlying message-passing patterns for graph representation. Several effective neural network architectures have been proposed, such as Graph Attention Network(GAT)[31], Graph Convolution Network(GCN)[20], and GraphSAGE[11]. Recently, there has been a surge of interest in exploring transformer-based encoders for graph machine learning[49, 45, 36, 37]. However, a notable limitation of GNNs is their generalization capability. Typically, GNNs are trained on specific tasks within particular datasets, and when faced with new datasets or tasks, they often struggle to consistently perform well across different datasets or downstream tasks[19].

4.2 Self-supervised learning and prompt-tuning for GNNs

To alleviate the demand for labeled data and enhance the robustness of graph models, self-supervised learning is commonly employed in GNN training[38, 52, 12]. Methods like Deep Graph Infomax(DGI)[32] utilize mutual information maximization for pre-training. Other approaches, such as GraphCL[44], GCA[53], GCC[26], and JOAO[47], learn node representations by contrasting positive and negative samples. GraphMAE[15, 16], on the other hand, learns representations by generating samples that resemble the original graph structure. However, these methods typically require fine-tuning the task-specific heads for downstream applications.

Various methods have explored the use of prompt techniques to enhance the generalization of GNNs. To address the inconsistency between pre-training and downstream task objectives, GraphPrompt[25] proposes a unified task template applicable to both stages. Additionally, ProG[29] reformulates various task types into a unified graph-level representation and employs meta-learning techniques to enhance multi-task learning capabilities. However, whether through self-supervised learning or graph prompt methods, fine-tuning is often necessary when handling new datasets. Moreover, when confronted with datasets containing varying numbers of categories, retraining of task heads is required to achieve optimal performance.

4.3 Large language models for graphs

With the rapid advancement of Large Language Models(LLMs) and their remarkable generalization capabilities, leveraging LLMs to address transferability issues in graph machine learning has garnered significant attention[10, 14]. Some methods represent graph structure information as text input to LLMs[3, 33, 23]; however, this approach often leads to suboptimal solutions[18]. Another paradigm involves using LLMs as enhancers[43, 48, 39, 4, 24], where they generate data or node text representations. Despite this, since GNNs are ultimately used for prediction, this approach significantly limits the model’s transferability. Recently, considerable efforts have been made to utilize LLMs as predictors. For instance, GraphGPT[30] attempts to align LLMs with pre-trained Graph Transformer encoders through two-stage fine-tuning. However, the fine-tuning, conducted on specific datasets, might weaken the method’s transferability. In light of this, LLaGA[2] introduced a novel encoding method that directly translates graph data into sequences compatible with LLMs. However, this approach may compromise performance due to the lack of GNN filtering and aggregation of graph information. Inspired by these challenges, we propose a pre-training strategy that enhances GNN transferability by aligning its representations with the token embeddings of LLMs, resulting in improved performance in zero-shot tasks. Notably, similar to our method, TEST [27] aligns time series representations with several selected LLM token embeddings. However, our approach differs in that we project graph representations into a feature space defined by the principal components of LLM token embeddings. This enables the LLM to function as a zero-shot learner for graph machine learning tasks, rather than just enhancing performance on specific, seen tasks.

5 Limitations

While our TEA-GLM framework demonstrates considerable promise in enhancing zero-shot learning for graph-based tasks, it does have some limitations. Although the framework we designed can be easily applied to graph-level tasks, we have not yet explored the model’s performance through specific experiments. This will be addressed in our future work.

6 Conclusion

This paper introduces TEA-GLM, a framework that enhances zero-shot learning in graph machine learning by aligning GNN representations with LLM token embeddings. TEA-GLM uses a linear projector to map graph representations into graph token embeddings and incorporates a unified instruction design to handle various graph tasks at different levels. This approach enables consistent performance across various datasets and tasks without task-specific fine-tuning. Extensive experiments show that TEA-GLM outperforms state-of-the-art methods in accuracy and generalization, demonstrating its effectiveness and efficiency in zero-shot learning for graph tasks.

7 Acknowledgement

This work was supported by the National Key R&D Program of China (2023YFC3304700). Dr. Junjie Wu’s work was partially supported by the National Natural Science Foundation of China (72242101, 72031001) and Outstanding Young Scientist Program of Beijing Universities (JWZQ20240201002).

References

  • Chen etal. [2018]Jie Chen, Tengfei Ma, and Cao Xiao.FastGCN: Fast learning with graph convolutional networks via importance sampling.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=rytstxWAW.
  • Chen etal. [2024a]Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang.Llaga: Large language and graph assistant.In ICML, 2024a.
  • Chen etal. [2023]Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, and Jiliang Tang.Exploring the potential of large language models (LLMs) in learning on graph.In NeurIPS 2023 Workshop: New Frontiers in Graph Learning, 2023.URL https://openreview.net/forum?id=ScNNo7v4t0.
  • Chen etal. [2024b]Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, and Jiliang Tang.Label-free node classification on graphs with large language models (LLMs).In The Twelfth International Conference on Learning Representations, 2024b.URL https://openreview.net/forum?id=hESD2NJFg8.
  • Cheng etal. [2023]Jiashun Cheng, Man Li, Jia Li, and Fugee Tsung.Wiener graph deconvolutional network improves graph self-supervised learning.In AAAI, 2023.URL https://doi.org/10.1609/aaai.v37i6.25870.
  • Chiang etal. [2019]Wei-Lin Chiang, Xuanqing Liu, SiSi, Yang Li, Samy Bengio, and Cho-Jui Hsieh.Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks.In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 257–266, 2019.
  • Chiang etal. [2023]Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.URL https://lmsys.org/blog/2023-03-30-vicuna/.
  • Devlin etal. [2019]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.URL https://aclanthology.org/N19-1423.
  • Gao etal. [2018]Hongyang Gao, Zhengyang Wang, and Shuiwang Ji.Large-scale learnable graph convolutional networks.In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1416–1424, 2018.URL https://doi.org/10.1145/3219819.3219947.
  • Guo etal. [2023]Jiayan Guo, Lun Du, and Hengyu Liu.Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking.ArXiv, abs/2305.15066, 2023.URL https://api.semanticscholar.org/CorpusID:258865990.
  • Hamilton etal. [2017]Will Hamilton, Zhitao Ying, and Jure Leskovec.Inductive representation learning on large graphs.In Advances in Neural Information Processing Systems, 2017.URL https://proceedings.neurips.cc/paper_files/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
  • Hassani and Khasahmadi [2020]Kaveh Hassani and AmirHosein Khasahmadi.Contrastive multi-view representation learning on graphs.In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • He etal. [2024]Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi.Harnessing explanations: LLM-to-LM interpreter for enhanced text-attributed graph representation learning.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=RXFVcynVe1.
  • He and Hooi [2024]Yufei He and Bryan Hooi.Unigraph: Learning a cross-domain graph foundation model from natural language.arXiv preprint arXiv:2402.13630, 2024.
  • Hou etal. [2022]Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang.Graphmae: Self-supervised masked graph autoencoders.In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 594–604, 2022.URL https://doi.org/10.1145/3534678.3539321.
  • Hou etal. [2023]Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang.Graphmae2: A decoding-enhanced masked self-supervised graph learner.In Proceedings of the ACM Web Conference 2023, page 737–746, 2023.URL https://doi.org/10.1145/3543507.3583379.
  • Hu etal. [2020]Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.Open graph benchmark: Datasets for machine learning on graphs.In Advances in Neural Information Processing Systems, pages 22118–22133, 2020.URL https://proceedings.neurips.cc/paper_files/paper/2020/file/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf.
  • Huang etal. [2023]Jin Huang, Xingjian Zhang, Qiaozhu Mei, and Jiaqi Ma.Can llms effectively leverage graph structural information: when and why.arXiv preprint arXiv:2309.16595, 2023.
  • Ju etal. [2023]Mingxuan Ju, Tong Zhao, Qianlong Wen, Wenhao Yu, Neil Shah, Yanfang Ye, and Chuxu Zhang.Multi-task self-supervised graph neural networks enable stronger task generalization.In The Eleventh International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=1tHAZRqftM.
  • Kipf and Welling [2017a]ThomasN. Kipf and Max Welling.Semi-supervised classification with graph convolutional networks.In International Conference on Learning Representations, 2017a.URL https://openreview.net/forum?id=SJU4ayYgl.
  • Kipf and Welling [2017b]ThomasN. Kipf and Max Welling.Semi-supervised classification with graph convolutional networks.In International Conference on Learning Representations, 2017b.URL https://openreview.net/forum?id=SJU4ayYgl.
  • Li etal. [2019]Jia Li, Zhichao Han, Hong Cheng, Jiao Su, Pengyun Wang, Jianfeng Zhang, and Lujia Pan.Predicting path failure in time-evolving graphs.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1279–1289, 2019.URL https://doi.org/10.1145/3292500.3330847.
  • Liu and Wu [2023]Chang Liu and BoWu.Evaluating large language models on graphs: Performance insights and comparative analysis.arXiv preprint arXiv:2308.11224, 2023.
  • Liu etal. [2024]Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang.One for all: Towards training one graph model for all classification tasks.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=4IT2pgc9v6.
  • Liu etal. [2023]Zemin Liu, Xingtong Yu, Yuan Fang, and Xinming Zhang.Graphprompt: Unifying pre-training and downstream tasks for graph neural networks.In Proceedings of the ACM Web Conference 2023, page 417–428, 2023.URL https://doi.org/10.1145/3543507.3583386.
  • Qiu etal. [2020]Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang.Gcc: Graph contrastive coding for graph neural network pre-training.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1150–1160, 2020.URL https://doi.org/10.1145/3394486.3403168.
  • Sun etal. [2024]Chenxi Sun, Hongyan Li, Yaliang Li, and Shenda Hong.TEST: Text prototype aligned embedding to activate LLM’s ability for time series.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=Tuh4nZVb0g.
  • Sun etal. [2021]Xiangguo Sun, Hongzhi Yin, BoLiu, Hongxu Chen, Qing Meng, Wang Han, and Jiuxin Cao.Multi-level hyperedge distillation for social linking prediction on sparsely observed networks.In Proceedings of the Web Conference 2021, page 2934–2945, 2021.URL https://doi.org/10.1145/3442381.3449912.
  • Sun etal. [2023]Xiangguo Sun, Hong Cheng, Jia Li, BoLiu, and Jihong Guan.All in one: Multi-task prompting for graph neural networks.In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 2120–2131, 2023.URL https://doi.org/10.1145/3580305.3599256.
  • Tang etal. [2023]Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang.Graphgpt: Graph instruction tuning for large language models.arXiv preprint arXiv:2310.13023, 2023.
  • Veličković etal. [2018]Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio.Graph attention networks.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=rJXMpikCZ.
  • Veličković etal. [2019]Petar Veličković, William Fedus, WilliamL. Hamilton, Pietro Liò, Yoshua Bengio, and RDevon Hjelm.Deep graph infomax.In International Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=rklz9iAcKQ.
  • Wang etal. [2023]Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov.Can language models solve graph problems in natural language?In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=UDqHhbqYJV.
  • Wei etal. [2022]Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV Le.Finetuned language models are zero-shot learners.In International Conference on Learning Representations, 2022.URL https://openreview.net/forum?id=gEZrGCozdqR.
  • Wen and Fang [2023]Zhihao Wen and Yuan Fang.Augmenting low-resource text classification with graph-grounded pre-training and prompting.In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 506–516, 2023.doi: 10.1145/3539618.3591641.URL https://doi.org/10.1145/3539618.3591641.
  • Wu etal. [2022]Qitian Wu, Wentao Zhao, Zenan Li, David Wipf, and Junchi Yan.Nodeformer: A scalable graph structure learning transformer for node classification.In AliceH. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.URL https://openreview.net/forum?id=sMezXGG5So.
  • Wu etal. [2023]Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, and Junchi Yan.DIFFormer: Scalable (graph) transformers induced by energy constrained diffusion.In The Eleventh International Conference on Learning Representations, 2023.URL https://openreview.net/forum?id=j6zUzrapY3L.
  • Xia etal. [2022]Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and StanZ. Li.Simgrace: A simple framework for graph contrastive learning without data augmentation.In Proceedings of the ACM Web Conference 2022, page 1070–1079, 2022.URL https://doi.org/10.1145/3485447.3512156.
  • Xia etal. [2024]Lianghao Xia, Ben Kao, and Chao Huang.Opengraph: Towards open graph foundation models.arXiv preprint arXiv:2403.01121, 2024.
  • Xu etal. [2019]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka.How powerful are graph neural networks?In International Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Yan etal. [2023]Hao Yan, Chaozhuo Li, Ruosong Long, Chao Yan, Jianan Zhao, Wenwen Zhuang, Jun Yin, Peiyan Zhang, Weihao Han, Hao Sun, Weiwei Deng, QiZhang, Lichao Sun, Xing Xie, and Senzhang Wang.A comprehensive study on text-attributed graphs: Benchmarking and rethinking.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.URL https://openreview.net/forum?id=m2mbfoSuJ1.
  • Yang etal. [2022]Chenxiao Yang, Qitian Wu, and Junchi Yan.Geometric knowledge distillation: Topology compression for graph neural networks.In Advances in Neural Information Processing Systems, 2022.URL https://openreview.net/forum?id=7WGNT3MHyBm.
  • Ye etal. [2023]Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang.Natural language is all a graph needs.arXiv preprint arXiv:2308.07134, 2023.
  • Ying etal. [2021a]Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, DiHe, Yanming Shen, and Tie-Yan Liu.Do transformers really perform badly for graph representation?In Advances in Neural Information Processing Systems, pages 28877–28888, 2021a.URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf.
  • Ying etal. [2021b]Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, DiHe, Yanming Shen, and Tie-Yan Liu.Do transformers really perform badly for graph representation?In Advances in Neural Information Processing Systems, pages 28877–28888, 2021b.URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf.
  • You etal. [2020]Y.You, T.Chen, Z.Wang, and Y.Shen.L2-gcn: Layer-wise and learned efficient training of graph convolutional networks.In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2124–2132, 2020.URL https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00220.
  • You etal. [2021]Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang.Graph contrastive learning automated.In ICML, 2021.URL https://arxiv.org/abs/2106.07594.
  • Yu etal. [2023]Jianxiang Yu, Yuxiang Ren, Chenghua Gong, Jiaqi Tan, Xiang Li, and Xuecang Zhang.Empower text-attributed graphs learning with large language models (llms).arXiv preprint arXiv:2310.09872, 2023.
  • Yun etal. [2019]Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and HyunwooJ Kim.Graph transformer networks.In Advances in Neural Information Processing Systems, 2019.URL https://proceedings.neurips.cc/paper_files/paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf.
  • Zhang etal. [2024]Mengmei Zhang, Mingwei Sun, Peng Wang, Shen Fan, Yanhu Mo, Xiaoxiao Xu, Hong Liu, Cheng Yang, and Chuan Shi.Graphtranslator: Aligning graph model to large language model for open-ended tasks.In Proceedings of the ACM Web Conference 2023, 2024.
  • Zhang etal. [2022]Shichang Zhang, Yozen Liu, Yizhou Sun, and Neil Shah.Graph-less neural networks: Teaching old MLPs new tricks via distillation.In International Conference on Learning Representations, 2022.URL https://openreview.net/forum?id=4p6_5HBWPCw.
  • Zhu etal. [2020]Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang.Deep graph contrastive representation learning.arXiv preprint arXiv:2006.04131, 2020.
  • Zhu etal. [2021]Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang.Graph contrastive learning with adaptive augmentation.In Proceedings of the Web Conference 2021, page 2069–2080, 2021.URL https://doi.org/10.1145/3442381.3449802.

Appendix A Dataset description

DomainDataset#Nodes#Edges#Classes
CitationArxiv169,3431,166,24340
Pubmed19,71744,3383
Cora25,12091,14070
E-commerceEle-Computer87,229721,08110
Ele-Photo48,362500,92812
Book-Children76,8751,554,57824
Book-History41,551358,57412
Sports-Fitness173,0551,773,50013

Citation datasets

The Arxiv dataset[17] represents a directed citation network among Computer Science (CS) papers from the arXiv preprint server. Each node in this graph corresponds to a paper, while edges represent citation links. The PubMed dataset[13] comprises 19,717 scientific publications from the PubMed database related to diabetes, which are categorized into three distinct classes: Experimentally induced diabetes, Type 1 diabetes, and Type 2 diabetes. This classification reflects the focus of each publication within the broader context of diabetes research. Lastly, the Cora dataset[35], formally known as the “Cora Research Paper Classification Dataset”, provides a comprehensive network for analyzing research paper classifications in machine learning. It is an extended version of the dataset commonly referred to in other studies[21], featuring detailed categorizations.

E-commmerce datasets

All e-commerce datasets are provided in the TAG benchmark[41]. The Books-Children and Books-History datasets are extracted from the Amazon-Books dataset. Books-Children includes items with the second-level label “Children”, while Books-History includes items with the second-level label “History”. Each dataset’s label corresponds to the three-level label of the book. The Ele-Computers dataset comprises items with the second-level label “Computers”, and Ele-Photo includes items with the second-level label “Photo”. Each of these datasets is labeled at the third level for electronic products. The Sports-Fitness dataset, sourced from the Amazon-Sports dataset, contains items with the second-level label “Fitness”. Nodes in this dataset represent fitness-related items, and an edge between two items indicates they are frequently co-purchased or co-viewed.

Appendix B More experimental results

B.1 Legality rate

DatasetArxivComputerPubmedCoraChildrenHistoryPhotoSports
ModelLegality rate(%)
Vicuna-7B-v1.599.396.7100.095.899.298.994.199.6
LLaGA100.0100.098.979.993.192.477.894.3
TEA-GLM100.0100.0100.092.697.099.699.298.5

After training on specific datasets or tasks, large language models (LLMs) may produce invalid or incorrect answers to given questions. For instance, when handling unseen datasets or tasks, LLMs may generate responses that fall outside the set of acceptable answer candidates. To evaluate the impact of the training process on LLM performance, we follow the approach in [50] and use the legality rate to measure the proportion of valid answers produced by the model.

Table4 demonstrates that the illegality rate of the LLaGA model significantly increases when exposed to datasets it has not previously encountered, suggesting a substantial impact of training methodologies on both the acquisition of knowledge and the model’s ability to generalize. Conversely, our model exhibits a notably stable performance across diverse unseen datasets, achieving higher legality rates in several cases.

B.2 F1 score on node classification task

Model typeModelCitationE-commerce
PubmedCoraChildrenHistoryPhotoSports
MLP0.246±plus-or-minus\pm±0.0420.009±plus-or-minus\pm±0.0040.007±plus-or-minus\pm±0.0070.023±plus-or-minus\pm±0.0080.041±plus-or-minus\pm±0.0230.019±plus-or-minus\pm±0.005
GNN aspredictorGCN0.187±plus-or-minus\pm±0.0210.007±plus-or-minus\pm±0.0010.006±plus-or-minus\pm±0.0040.024±plus-or-minus\pm±0.0130.034±plus-or-minus\pm±0.0070.017±plus-or-minus\pm±0.009
GraphSAGE0.257±plus-or-minus\pm±0.0840.007±plus-or-minus\pm±0.0030.005±plus-or-minus\pm±0.0030.029±plus-or-minus\pm±0.0240.020±plus-or-minus\pm±0.0110.021±plus-or-minus\pm±0.004
GAT0.259±plus-or-minus\pm±0.0650.006±plus-or-minus\pm±0.0010.063±plus-or-minus\pm±0.0670.159±plus-or-minus\pm±0.1170.036±plus-or-minus\pm±0.0350.091±plus-or-minus\pm±0.090
DGI0.213±plus-or-minus\pm±0.1270.004±plus-or-minus\pm±0.0020.012±plus-or-minus\pm±0.0040.038±plus-or-minus\pm±0.0150.045±plus-or-minus\pm±0.0150.018±plus-or-minus\pm±0.005
GKD0.247±plus-or-minus\pm±0.0390.004±plus-or-minus\pm±0.0010.028±plus-or-minus\pm±0.0030.060±plus-or-minus\pm±0.0080.049±plus-or-minus\pm±0.0150.050±plus-or-minus\pm±0.008
GLNN0.221±plus-or-minus\pm±0.0330.006±plus-or-minus\pm±0.0010.021±plus-or-minus\pm±0.0030.064±plus-or-minus\pm±0.0070.057±plus-or-minus\pm±0.0020.052±plus-or-minus\pm±0.003
NodeFormer0.232±plus-or-minus\pm±0.0890.008±plus-or-minus\pm±0.0030.019±plus-or-minus\pm±0.0080.046±plus-or-minus\pm±0.0310.055±plus-or-minus\pm±0.0060.049±plus-or-minus\pm±0.009
DIFFormer0.187±plus-or-minus\pm±0.0070.007±plus-or-minus\pm±0.0020.002±plus-or-minus\pm±0.0020.050±plus-or-minus\pm±0.0190.069±plus-or-minus\pm±0.0100.045±plus-or-minus\pm±0.007
OFA0.287±plus-or-minus\pm±0.0590.091±plus-or-minus\pm±0.0130.017±plus-or-minus\pm±0.0100.026±plus-or-minus\pm±0.0070.103±plus-or-minus\pm±0.0070.043±plus-or-minus\pm±0.021
LLM aspredictorVicuna-7B-v1.50.629±plus-or-minus\pm±0.0240.109±plus-or-minus\pm±0.0020.279±plus-or-minus\pm±0.0020.349±plus-or-minus\pm±0.0030.383±plus-or-minus\pm±0.0010.410±plus-or-minus\pm±0.002
GraphGPT-std0.6490.082----
GraphGPT-cot0.4820.127----
LLaGA0.778±plus-or-minus\pm±0.0560.108±plus-or-minus\pm±0.0140.163±plus-or-minus\pm±0.0290.144±plus-or-minus\pm±0.0250.362±plus-or-minus\pm±0.0390.446±plus-or-minus\pm±0.035
TEA-GLM0.839±plus-or-minus\pm±0.0120.148±plus-or-minus\pm±0.0150.252±plus-or-minus\pm±0.0050.365±plus-or-minus\pm±0.0110.421±plus-or-minus\pm±0.0320.430±plus-or-minus\pm±0.009

Due to the absence of a metric to calculate the F1 score while considering the illegality rate, we adopt the methodology used in[50]. For the LLM-backbone models, we only calculate the Macro F1 score for legally permissible responses provided by the model. This calculation method may not accurately reflect the model’s performance fully. Therefore, we also report the illegality rate in Table4. Please note that the accuracy metric is unaffected by illegal responses, which are considered error responses.

B.3 Supervised results

Model typeModelArxivComputer
AccF1AccF1
MLP0.546±plus-or-minus\pm±0.0040.295±plus-or-minus\pm±0.0070.420±plus-or-minus\pm±0.0060.267±plus-or-minus\pm±0.005
GNN aspredictorGCN0.545±plus-or-minus\pm±0.0050.317±plus-or-minus\pm±0.0060.424±plus-or-minus\pm±0.0120.386±plus-or-minus\pm±0.014
GraphSAGE0.556±plus-or-minus\pm±0.0060.315±plus-or-minus\pm±0.0080.534±plus-or-minus\pm±0.0370.347±plus-or-minus\pm±0.036
GAT0.561±plus-or-minus\pm±0.0030.339±plus-or-minus\pm±0.0050.609±plus-or-minus\pm±0.0350.598±plus-or-minus\pm±0.039
DGI0.342±plus-or-minus\pm±0.0240.336±plus-or-minus\pm±0.0110.594±plus-or-minus\pm±0.0040.452±plus-or-minus\pm±0.008
GKD0.393±plus-or-minus\pm±0.0850.164±plus-or-minus\pm±0.0290.351±plus-or-minus\pm±0.0310.155±plus-or-minus\pm±0.016
GLNN0.602±plus-or-minus\pm±0.0040.362±plus-or-minus\pm±0.0080.393±plus-or-minus\pm±0.0050.243±plus-or-minus\pm±0.007
NodeFormer0.544±plus-or-minus\pm±0.0160.297±plus-or-minus\pm±0.0290.434±plus-or-minus\pm±0.0120.288±plus-or-minus\pm±0.012
DIFFormer0.616±plus-or-minus\pm±0.0250.356±plus-or-minus\pm±0.0240.629±plus-or-minus\pm±0.0120.467±plus-or-minus\pm±0.022
OFA0.682±plus-or-minus\pm±0.0060.495±plus-or-minus\pm±0.0060.753±plus-or-minus\pm±0.0040.687±plus-or-minus\pm±0.006
LLM aspredictorVicuna-7B-v1.50.347±plus-or-minus\pm±0.0000.164±plus-or-minus\pm±0.0010.372±plus-or-minus\pm±0.0100.304±plus-or-minus\pm±0.002
GraphGPT-std0.6260.262--
GraphGPT-cot0.5760.228--
LLaGA0.749±plus-or-minus\pm±0.0010.575±plus-or-minus\pm±0.0030.642±plus-or-minus\pm±0.0040.562±plus-or-minus\pm±0.001
TEA-GLM0.655±plus-or-minus\pm±0.0010.445±plus-or-minus\pm±0.0020.578±plus-or-minus\pm±0.0020.496±plus-or-minus\pm±0.010

We report the supervised learning results in Table6. The GNN-backbone models continue to demonstrate robust performance in fitting training data. Similarly, the LLaGA model shows its efficacy in supervised learning scenarios. However, despite their strong performance on training datasets, these models exhibit limited generalization capabilities on unseen datasets as shown in Table1 and Table5.

Appendix C Parameter sensitivity analysis

LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (4)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (5)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (6)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (7)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (8)
LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (9)
Number of graph token embeddings

To discuss the impact of the number of graph token embeddings, we set K{1,3,5,10}𝐾13510K\in\left\{1,3,5,10\right\}italic_K ∈ { 1 , 3 , 5 , 10 } and report the results on node classification task in Figure3. In the context of training datasets and unseen datasets, we observe two distinct patterns. With an increase in the number of graph token embeddings in the training dataset, there is a slight improvement in the model’s performance on that dataset. This suggests that in a supervised learning scenario, enhancing the model’s performance can be achieved by increasing the quantity of graph token embeddings. Conversely, for unseen datasets, our model requires only a minimal number of graph token embeddings to achieve satisfactory performance, indicating that the number of learnable parameters in our model is significantly less than concurrent works.

Number of principal components

We define P{0,100,1000,2000,3000}𝑃0100100020003000P\in\left\{0,100,1000,2000,3000\right\}italic_P ∈ { 0 , 100 , 1000 , 2000 , 3000 } and discuss the results of the node classification task in Figure4. In supervised learning scenarios, omitting contrastive learning with principal components can lead to a slight increase in accuracy. However, this often makes the model more prone to overfitting on training datasets. When the number of principal components is too small, it adversely affects the model’s learning capability. Remarkably, when P=1000𝑃1000P=1000italic_P = 1000, the model demonstrates satisfactory performance. At this level, the principal components capture 50%percent5050\%50 % of the variance of LLM’s token embeddings.

Appendix D Complete instructions

LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (10)

In node classification tasks, we provide candidate labels to facilitate the model’s learning process, focusing on discovering the correct answers rather than merely memorizing them. For link prediction, we structure the instructions in a format similar to that of node classification. This approach is designed to enhance the model’s ability to transfer learned knowledge effectively across different tasks.

Appendix E Cross-task zero-shot results with different pooling methods

ModelCitation
ArxivPubmedCora
OFA0.4690.4810.492
Vicuna-7B-v1.50.5130.5430.527
Vicuna-7B-SPT0.5370.5350.565
GraphGPT-std0.6490.5010.520
LLaGA0.5700.5690.537
TEA-GLM (max)0.6390.6500.566
TEA-GLM (sum)0.6570.6890.586
TEA-GLM (mean)0.6590.6900.588

Considering that different pooling methods may impact cross-task performance, we conducted experiments using three common pooling methods separately, and the results are shown in the Table 7.

LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dong Thiel

Last Updated:

Views: 5754

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.