Machine Learning with Graphs : 6 Graph Neural Networks

Stanford CS224W : Machine Learning with Graphs, Fall 2021:

  • 6 Graph Neural Networks

  • site

1. Overview

1.1 Node Embeddings

Limitations of shallow embedding methods:

\(\mathcal{O}(|V|)\) parameters are needed

  • No sharing of parameters between nodes

  • Every node has its own unique embedding

Inherently "transductive" (直推式):

  • Cannot generate embeddings for nodes that are not seen during training

Do not incorporate node features

  • Nodes in many graphs have features that we can and
    should leverage

1.2 Deep Graph Encoders

Graph neural networks (GNNs):

\[\text{ENC}(v) = \begin{array}{c} \text{multiple layers of non-linear transformations } \\ \text{based on graph structure} \end{array} \]

All these deep encoders can be combined with node similarity functions defined in the Lecture 3

1.2.1 Tasks on Networks

Tasks we will be able to solve:

  • Node classification

    • Predict a type of a given node
  • Link prediction

    • Predict whether two nodes are linked
  • Community detection

    • Identify densely linked clusters of nodes
  • Network similarity

    • How similar are two (sub)networks

2. Basics of Deep Learning

3. Deep Learning for Graphs

3.1 Setup

Assumption

  • \(G\) : a graph

  • \(V\) : vertex set

  • \(\boldsymbol{A}\) : adjacency matrix (assume binary)

  • \(\boldsymbol{X} \in \mathbb{R}^{m \times |V|}\) : a matrix of node features

  • \(v \in V\) : a node

  • \(N(v)\) : the set of neighbors of \(v\)

Node features :

  • Social networks: User profile, User image

  • Biological networks: Gene expression profiles, gene
    functional information

  • When there is no node feature in the graph dataset:

    • Indicator vectors (one-hot encoding of a node)

    • Vector of constant \(1: [1, 1, \cdots, 1]\)

3.2 Permutation Invariant

Graph is permutation invariant / permutation equivariance

For node representation :

  • Consider we learn a function \(f\) that maps a graph \(G = (\boldsymbol{A}, \boldsymbol{X})\) to a matrix \(\mathbb{R}^{m \times d}\)

    • graph has \(m\) nodes, each row is the embedding of a node.

    • Similarly, if this property holds for any pair of order plan \(i\) and \(j\), we say \(f\) is a permutation equivariant function.

3.3 Overview of Graph Neural Networks

Graph neural networks consist of multiple permutation equivariant / invariant functions

  • Other neural network architectures, e.g., MLPs, are not permutation invariant / equivariant

4. Graph Convolutional Networks

Idea : Node's neighborhood defines a computation graph

Learn how to propagate information across the graph to compute node features

  • Determine node computation graph

  • Propagate and transform information

4.1 Neighborhood aggregation

Key idea : Generate node embeddings based on local network neighborhoods

4.1.1 Aggregate Neighbors

Intuition :

  • Nodes aggregate information from their neighbors using neural networks

  • Network neighborhood defines a computation graph

Deep Model : Model can be of arbitrary depth :

  • Nodes have embeddings at each layer

  • Layer-0 embedding of node \(v\) is its input feature, \(x_v\)

  • Layer-\(k\) embedding gets information from nodes that are \(k\) hops away

4.1.2 Neighborhood Aggregation

Neighborhood aggregation : Key distinctions (主要区别) are in how different approaches aggregate information across the layers

Basic approach : Average information from neighbors and apply a neural network

(1) average messages from neighbors

(2) apply neural network

4.2 The Math: Deep Encoder

Basic approach : Average neighbor messages and apply a neural network

\[\begin{align*} & \boldsymbol{h}_v^{(0)} = \boldsymbol{x}_v \\ & \boldsymbol{h}_v^{(k+1)} = \sigma \left( \boldsymbol{W}_k \cdot \sum_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k)}}{|N(v)|} + \boldsymbol{B}_k \cdot \boldsymbol{h}_v^{(k)} \right), \quad \forall k \in \{0, 1, \cdots, K-1 \} \\ & \boldsymbol{z}_v = \boldsymbol{h}_v^{(K)} \end{align*} \]

  • where \(\boldsymbol{h}_v^{(0)} = \boldsymbol{x}_v\) is the initial 0-th layer embeddings are equal to node features.

  • \(\boldsymbol{z}_v = \boldsymbol{h}_v^{(K)}\) is the embedding after \(K\) layers of neighborhood aggregation

  • \(\sigma(\cdot)\) is non-linear activate function (e.g., ReLU)

  • \(\sum \limits_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k)}}{|N(v)|}\), average of neighbor's previous layer embeddings

    Notice summation is a permutation invariant pooling / aggregation.

  • \(\boldsymbol{h}_v^{(k)}\) is embedding of \(v\) at layer \(k\)

  • \(K\) is the total number of layers

4.2.1 Equivariant Property

Equivariant Property: Message passing and neighbor aggregation in graph convolution networks is permutation equivariant.

4.3 Training the Model

\(\boldsymbol{W}_k\) and \(\boldsymbol{B}_k\) are trainable weight matrices (i.e., what we learn)

We can feed these embeddings into any loss function and run SGD to train the weight parameters

  • \(\boldsymbol{h}_v^{k}\) : the hidden representation of node \(v\) at layer \(k\)

  • \(\boldsymbol{W}_k\) : weight matrix for neighborhood aggregation

  • \(\boldsymbol{B}_k\) : weight matrix for transforming hidden vector of self

4.3.1 Matrix Formulation

Many aggregations can be performed efficiently by (sparse) matrix operations

  • Let \(\mathbf{H}^{(k)} = \left[\boldsymbol{h}_1^{(k)}, \cdots, \boldsymbol{h}_{|V|}^{(k)} \right]^{\top}\), the matrix of hidden embeddings

  • \(\sum \limits_{u \in N_v} \boldsymbol{h}_u^{(k)} = \boldsymbol{A}_{v \, \cdot} \, \mathbf{H}^{(k)}\)

  • \(\mathbf{D}\) is a diagonal matrix : \(\mathbf{D}_{vv} = \text{Deg}(v) = |N(v)|\)

    • \(\text{Deg}(v)\) is the degree of node \(v\) ; a.k.a. \(|N(v)|\), the number of neighbors of the node \(v\).

    • The inverse of \(\mathbf{D}\) : \(\mathbf{D}^{-1}\) is also diagonal : \(\mathbf{D}_{v v} = \text{diag}(v) = |N(v)|\)

Therefore, we have:

\[\sum_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k-1)}}{|N(v)|} \qquad \Rightarrow \qquad \mathbf{H}^{(k+1)} = \mathbf{D}^{-1} \mathbf{A} \mathbf{H}^{(k)} \]

Re-writing update function in matrix form:

\[\mathbf{H}^{(k+1)} = \sigma \big( \ \ \underbrace{ \mathbf{\tilde{A}} \mathbf{H}^{(k)} \mathbf{W}_k^{\top}}_{ \begin{array}{cc}\text{neighborhood} \\ \text{aggregation} \end{array}} \ \ + \underbrace{\mathbf{H}^{(k)} \mathbf{B}_k^{\top}}_{\begin{array}{cc}\text{self} \\ \text{transformation} \end{array}} \big) \]

  • where \(\mathbf{\tilde{A}} = \mathbf{D}^{-1} \mathbf{A}\)

In practice, this implies that efficient sparse matrix multiplication can be used ( is sparse))

Note : not all GNNs can be expressed in matrix form, when
aggregation function is complex.

4.3.2 Train A GNN

Node embedding \(\boldsymbol{z}_v\) is a function of input graph

Supervised setting : minimize the loss

\[\min_{\Theta} \mathcal{L}(\boldsymbol{y}, f(\boldsymbol{z}_v)) \]

  • \(\boldsymbol{y}\) node label

  • \(\mathcal{L}\) could be L2 if \(\boldsymbol{y}\) is real number

    • cross entropy if \(\boldsymbol{y}\) is categorical

Unsupervised setting :

  • No node label available

  • Use the graph structure as the supervision

4.3.3 Unsupervised Training

"Similar" nodes have similar embeddings

\[\mathcal{L} = \sum_{\boldsymbol{z}_{u},\, \boldsymbol{z}_{v}} \operatorname{CE} \left(\boldsymbol{y}_{u, v}, \operatorname{DEC}\left(\boldsymbol{z}_{u}, \boldsymbol{z}_{v}\right)\right) \]

  • where \(\boldsymbol{y}_{u, v} = 1\) when node \(u\) and \(v\) are similar

  • \(\text{CE}\) is the cross entropy

  • \(\text{DEC}\) is the decoder such as inner product

Node similarity can be anything from Lecture 3, e.g., a loss based on:

  • Random walks (node2vec, DeepWalk, struc2vec)

  • Matrix factorization

  • Node proximity in the graph

4.3.4 Supervised Training

Directly train the model for a supervised task (e.g., node classification)

  • Use cross entropy loss

\[\mathcal{L} = \sum_{v \in V} y_{v} \log \left[ \sigma (\boldsymbol{z}_{v}^{\top} \boldsymbol{\theta}) \right] + (1 - y_{v} ) \log \left[ \sigma (\boldsymbol{z}_{v}^{\top} \theta )\right] \]

  • where \(y_{v}\) : the node class label

  • \(\boldsymbol{z}_{v}\) : encoder output: node embedding

  • \(\boldsymbol{\theta}\) : classification weights

4.4 Model Design

(1) Define a neighborhood aggregation function

(2) Define a loss function on the embeddings

(3) Train on a set of nodes, i.e., a batch of compute graphs

(4) Generate embeddings for nodes as needed

  • Even for nodes we never trained on

4.5 Inductive Capability

The same aggregation parameters are shared for all nodes:

  • The number of model parameters is sublinear in \(|V|\) and we can generalize to unseen nodes!

New Graph

New Nodes

5. GNNs subsume CNNs and Transformers

6. Summary

Basics of neural networks

  • Loss, Optimization, Gradient, SGD, non-linearity, MLP

Idea for Deep Learning for Graphs

  • Multiple layers of embedding transformation

  • At every layer, use the embedding at previous layer as the input

  • Aggregation of neighbors and self-embeddings

Graph Convolutional Network

  • Mean aggregation; can be expressed in matrix form

GNN is a general architecture

  • CNN and Transformer can be viewed as a special GNN
posted @ 2022-07-13 20:53  veager  阅读(64)  评论(0)    收藏  举报