Machine Learning with Graphs : 6 Graph Neural Networks

Stanford CS224W : Machine Learning with Graphs, Fall 2021:

6 Graph Neural Networks
site

1. Overview

1.1 Node Embeddings

Limitations of shallow embedding methods:

\(\mathcal{O}(|V|)\) parameters are needed

No sharing of parameters between nodes
Every node has its own unique embedding

Inherently "transductive" (直推式):

Cannot generate embeddings for nodes that are not seen during training

Do not incorporate node features

Nodes in many graphs have features that we can and
should leverage

1.2 Deep Graph Encoders

Graph neural networks (GNNs):

\[\text{ENC}(v) = \begin{array}{c} \text{multiple layers of non-linear transformations } \\ \text{based on graph structure} \end{array} \]

All these deep encoders can be combined with node similarity functions defined in the Lecture 3

1.2.1 Tasks on Networks

Tasks we will be able to solve:

Node classification
- Predict a type of a given node
Link prediction
- Predict whether two nodes are linked
Community detection
- Identify densely linked clusters of nodes
Network similarity
- How similar are two (sub)networks

2. Basics of Deep Learning

3. Deep Learning for Graphs

3.1 Setup

Assumption

\(G\) : a graph
\(V\) : vertex set
\(\boldsymbol{A}\) : adjacency matrix (assume binary)
\(\boldsymbol{X} \in \mathbb{R}^{m \times |V|}\) : a matrix of node features
\(v \in V\) : a node
\(N(v)\) : the set of neighbors of \(v\)

Node features :

Social networks: User profile, User image
Biological networks: Gene expression profiles, gene
functional information
When there is no node feature in the graph dataset:
- Indicator vectors (one-hot encoding of a node)
- Vector of constant \(1: [1, 1, \cdots, 1]\)

3.2 Permutation Invariant

Graph is permutation invariant / permutation equivariance

For node representation :

Consider we learn a function \(f\) that maps a graph \(G = (\boldsymbol{A}, \boldsymbol{X})\) to a matrix \(\mathbb{R}^{m \times d}\)
- graph has \(m\) nodes, each row is the embedding of a node.
- Similarly, if this property holds for any pair of order plan \(i\) and \(j\), we say \(f\) is a permutation equivariant function.

3.3 Overview of Graph Neural Networks

Graph neural networks consist of multiple permutation equivariant / invariant functions

Other neural network architectures, e.g., MLPs, are not permutation invariant / equivariant

4. Graph Convolutional Networks

Idea : Node's neighborhood defines a computation graph

Learn how to propagate information across the graph to compute node features

Determine node computation graph
Propagate and transform information

4.1 Neighborhood aggregation

Key idea : Generate node embeddings based on local network neighborhoods

4.1.1 Aggregate Neighbors

Intuition :

Nodes aggregate information from their neighbors using neural networks
Network neighborhood defines a computation graph

Deep Model : Model can be of arbitrary depth :

Nodes have embeddings at each layer
Layer-0 embedding of node \(v\) is its input feature, \(x_v\)
Layer-\(k\) embedding gets information from nodes that are \(k\) hops away

4.1.2 Neighborhood Aggregation

Neighborhood aggregation : Key distinctions (主要区别) are in how different approaches aggregate information across the layers

Basic approach : Average information from neighbors and apply a neural network

(1) average messages from neighbors

(2) apply neural network

4.2 The Math: Deep Encoder

Basic approach : Average neighbor messages and apply a neural network

\[\begin{align*} & \boldsymbol{h}_v^{(0)} = \boldsymbol{x}_v \\ & \boldsymbol{h}_v^{(k+1)} = \sigma \left( \boldsymbol{W}_k \cdot \sum_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k)}}{|N(v)|} + \boldsymbol{B}_k \cdot \boldsymbol{h}_v^{(k)} \right), \quad \forall k \in \{0, 1, \cdots, K-1 \} \\ & \boldsymbol{z}_v = \boldsymbol{h}_v^{(K)} \end{align*} \]

where \(\boldsymbol{h}_v^{(0)} = \boldsymbol{x}_v\) is the initial 0-th layer embeddings are equal to node features.
\(\boldsymbol{z}_v = \boldsymbol{h}_v^{(K)}\) is the embedding after \(K\) layers of neighborhood aggregation
\(\sigma(\cdot)\) is non-linear activate function (e.g., ReLU)
\(\sum \limits_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k)}}{|N(v)|}\), average of neighbor's previous layer embeddings

Notice summation is a permutation invariant pooling / aggregation.
\(\boldsymbol{h}_v^{(k)}\) is embedding of \(v\) at layer \(k\)
\(K\) is the total number of layers

4.2.1 Equivariant Property

Equivariant Property： Message passing and neighbor aggregation in graph convolution networks is permutation equivariant.

4.3 Training the Model

\(\boldsymbol{W}_k\) and \(\boldsymbol{B}_k\) are trainable weight matrices (i.e., what we learn)

We can feed these embeddings into any loss function and run SGD to train the weight parameters

\(\boldsymbol{h}_v^{k}\) : the hidden representation of node \(v\) at layer \(k\)
\(\boldsymbol{W}_k\) : weight matrix for neighborhood aggregation
\(\boldsymbol{B}_k\) : weight matrix for transforming hidden vector of self

4.3.1 Matrix Formulation

Many aggregations can be performed efficiently by (sparse) matrix operations

Let \(\mathbf{H}^{(k)} = \left[\boldsymbol{h}_1^{(k)}, \cdots, \boldsymbol{h}_{|V|}^{(k)} \right]^{\top}\), the matrix of hidden embeddings
\(\sum \limits_{u \in N_v} \boldsymbol{h}_u^{(k)} = \boldsymbol{A}_{v \, \cdot} \, \mathbf{H}^{(k)}\)
\(\mathbf{D}\) is a diagonal matrix : \(\mathbf{D}_{vv} = \text{Deg}(v) = |N(v)|\)
- \(\text{Deg}(v)\) is the degree of node \(v\) ; a.k.a. \(|N(v)|\), the number of neighbors of the node \(v\).
- The inverse of \(\mathbf{D}\) : \(\mathbf{D}^{-1}\) is also diagonal : \(\mathbf{D}_{v v} = \text{diag}(v) = |N(v)|\)

Therefore, we have:

\[\sum_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k-1)}}{|N(v)|} \qquad \Rightarrow \qquad \mathbf{H}^{(k+1)} = \mathbf{D}^{-1} \mathbf{A} \mathbf{H}^{(k)} \]

Re-writing update function in matrix form:

\[\mathbf{H}^{(k+1)} = \sigma \big( \ \ \underbrace{ \mathbf{\tilde{A}} \mathbf{H}^{(k)} \mathbf{W}_k^{\top}}_{ \begin{array}{cc}\text{neighborhood} \\ \text{aggregation} \end{array}} \ \ + \underbrace{\mathbf{H}^{(k)} \mathbf{B}_k^{\top}}_{\begin{array}{cc}\text{self} \\ \text{transformation} \end{array}} \big) \]

where \(\mathbf{\tilde{A}} = \mathbf{D}^{-1} \mathbf{A}\)

In practice, this implies that efficient sparse matrix multiplication can be used ( is sparse))

Note : not all GNNs can be expressed in matrix form, when
aggregation function is complex.

4.3.2 Train A GNN

Node embedding \(\boldsymbol{z}_v\) is a function of input graph

Supervised setting : minimize the loss

\[\min_{\Theta} \mathcal{L}(\boldsymbol{y}, f(\boldsymbol{z}_v)) \]

\(\boldsymbol{y}\) node label
\(\mathcal{L}\) could be L2 if \(\boldsymbol{y}\) is real number
- cross entropy if \(\boldsymbol{y}\) is categorical

Unsupervised setting :

No node label available
Use the graph structure as the supervision

4.3.3 Unsupervised Training

"Similar" nodes have similar embeddings

\[\mathcal{L} = \sum_{\boldsymbol{z}_{u},\, \boldsymbol{z}_{v}} \operatorname{CE} \left(\boldsymbol{y}_{u, v}, \operatorname{DEC}\left(\boldsymbol{z}_{u}, \boldsymbol{z}_{v}\right)\right) \]

where \(\boldsymbol{y}_{u, v} = 1\) when node \(u\) and \(v\) are similar
\(\text{CE}\) is the cross entropy
\(\text{DEC}\) is the decoder such as inner product

Node similarity can be anything from Lecture 3, e.g., a loss based on:

Random walks (node2vec, DeepWalk, struc2vec)
Matrix factorization
Node proximity in the graph

4.3.4 Supervised Training

Directly train the model for a supervised task (e.g., node classification)

Use cross entropy loss

\[\mathcal{L} = \sum_{v \in V} y_{v} \log \left[ \sigma (\boldsymbol{z}_{v}^{\top} \boldsymbol{\theta}) \right] + (1 - y_{v} ) \log \left[ \sigma (\boldsymbol{z}_{v}^{\top} \theta )\right] \]

where \(y_{v}\) : the node class label
\(\boldsymbol{z}_{v}\) : encoder output: node embedding
\(\boldsymbol{\theta}\) : classification weights

4.4 Model Design

(1) Define a neighborhood aggregation function

(2) Define a loss function on the embeddings

(3) Train on a set of nodes, i.e., a batch of compute graphs

(4) Generate embeddings for nodes as needed

Even for nodes we never trained on

4.5 Inductive Capability

The same aggregation parameters are shared for all nodes:

The number of model parameters is sublinear in \(|V|\) and we can generalize to unseen nodes!

New Graph

New Nodes

5. GNNs subsume CNNs and Transformers

6. Summary

Basics of neural networks

Loss, Optimization, Gradient, SGD, non-linearity, MLP

Idea for Deep Learning for Graphs

Multiple layers of embedding transformation
At every layer, use the embedding at previous layer as the input
Aggregation of neighbors and self-embeddings

Graph Convolutional Network

Mean aggregation; can be expressed in matrix form

GNN is a general architecture

CNN and Transformer can be viewed as a special GNN

posted @ 2022-07-13 20:53 veager 阅读(68) 评论(0) 收藏举报

刷新页面返回顶部