Machine Learning with Graphs : 6 Graph Neural Networks
Stanford CS224W : Machine Learning with Graphs, Fall 2021:
-
6 Graph Neural Networks
1. Overview
1.1 Node Embeddings
Limitations of shallow embedding methods:
\(\mathcal{O}(|V|)\) parameters are needed
-
No sharing of parameters between nodes
-
Every node has its own unique embedding
Inherently "transductive" (直推式):
- Cannot generate embeddings for nodes that are not seen during training
Do not incorporate node features
- Nodes in many graphs have features that we can and
should leverage
1.2 Deep Graph Encoders
Graph neural networks (GNNs):
All these deep encoders can be combined with node similarity functions defined in the Lecture 3
1.2.1 Tasks on Networks
Tasks we will be able to solve:
-
Node classification
- Predict a type of a given node
-
Link prediction
- Predict whether two nodes are linked
-
Community detection
- Identify densely linked clusters of nodes
-
Network similarity
- How similar are two (sub)networks
2. Basics of Deep Learning
3. Deep Learning for Graphs
3.1 Setup
Assumption
-
\(G\) : a graph
-
\(V\) : vertex set
-
\(\boldsymbol{A}\) : adjacency matrix (assume binary)
-
\(\boldsymbol{X} \in \mathbb{R}^{m \times |V|}\) : a matrix of node features
-
\(v \in V\) : a node
-
\(N(v)\) : the set of neighbors of \(v\)
Node features :
-
Social networks: User profile, User image
-
Biological networks: Gene expression profiles, gene
functional information -
When there is no node feature in the graph dataset:
-
Indicator vectors (one-hot encoding of a node)
-
Vector of constant \(1: [1, 1, \cdots, 1]\)
-
3.2 Permutation Invariant
Graph is permutation invariant / permutation equivariance
For node representation :
-
Consider we learn a function \(f\) that maps a graph \(G = (\boldsymbol{A}, \boldsymbol{X})\) to a matrix \(\mathbb{R}^{m \times d}\)
-
graph has \(m\) nodes, each row is the embedding of a node.
-
Similarly, if this property holds for any pair of order plan \(i\) and \(j\), we say \(f\) is a permutation equivariant function.
-
3.3 Overview of Graph Neural Networks
Graph neural networks consist of multiple permutation equivariant / invariant functions
- Other neural network architectures, e.g., MLPs, are not permutation invariant / equivariant
4. Graph Convolutional Networks
Idea : Node's neighborhood defines a computation graph
Learn how to propagate information across the graph to compute node features
-
Determine node computation graph
-
Propagate and transform information
4.1 Neighborhood aggregation
Key idea : Generate node embeddings based on local network neighborhoods
4.1.1 Aggregate Neighbors
Intuition :
-
Nodes aggregate information from their neighbors using neural networks
-
Network neighborhood defines a computation graph
Deep Model : Model can be of arbitrary depth :
-
Nodes have embeddings at each layer
-
Layer-0 embedding of node \(v\) is its input feature, \(x_v\)
-
Layer-\(k\) embedding gets information from nodes that are \(k\) hops away
4.1.2 Neighborhood Aggregation
Neighborhood aggregation : Key distinctions (主要区别) are in how different approaches aggregate information across the layers
Basic approach : Average information from neighbors and apply a neural network
(1) average messages from neighbors
(2) apply neural network
4.2 The Math: Deep Encoder
Basic approach : Average neighbor messages and apply a neural network
-
where \(\boldsymbol{h}_v^{(0)} = \boldsymbol{x}_v\) is the initial 0-th layer embeddings are equal to node features.
-
\(\boldsymbol{z}_v = \boldsymbol{h}_v^{(K)}\) is the embedding after \(K\) layers of neighborhood aggregation
-
\(\sigma(\cdot)\) is non-linear activate function (e.g., ReLU)
-
\(\sum \limits_{u \in N(v)} \frac{\boldsymbol{h}_u^{(k)}}{|N(v)|}\), average of neighbor's previous layer embeddings
Notice summation is a permutation invariant pooling / aggregation.
-
\(\boldsymbol{h}_v^{(k)}\) is embedding of \(v\) at layer \(k\)
-
\(K\) is the total number of layers
4.2.1 Equivariant Property
Equivariant Property: Message passing and neighbor aggregation in graph convolution networks is permutation equivariant.
4.3 Training the Model
\(\boldsymbol{W}_k\) and \(\boldsymbol{B}_k\) are trainable weight matrices (i.e., what we learn)
We can feed these embeddings into any loss function and run SGD to train the weight parameters
-
\(\boldsymbol{h}_v^{k}\) : the hidden representation of node \(v\) at layer \(k\)
-
\(\boldsymbol{W}_k\) : weight matrix for neighborhood aggregation
-
\(\boldsymbol{B}_k\) : weight matrix for transforming hidden vector of self
4.3.1 Matrix Formulation
Many aggregations can be performed efficiently by (sparse) matrix operations
-
Let \(\mathbf{H}^{(k)} = \left[\boldsymbol{h}_1^{(k)}, \cdots, \boldsymbol{h}_{|V|}^{(k)} \right]^{\top}\), the matrix of hidden embeddings
-
\(\sum \limits_{u \in N_v} \boldsymbol{h}_u^{(k)} = \boldsymbol{A}_{v \, \cdot} \, \mathbf{H}^{(k)}\)
-
\(\mathbf{D}\) is a diagonal matrix : \(\mathbf{D}_{vv} = \text{Deg}(v) = |N(v)|\)
-
\(\text{Deg}(v)\) is the degree of node \(v\) ; a.k.a. \(|N(v)|\), the number of neighbors of the node \(v\).
-
The inverse of \(\mathbf{D}\) : \(\mathbf{D}^{-1}\) is also diagonal : \(\mathbf{D}_{v v} = \text{diag}(v) = |N(v)|\)
-
Therefore, we have:
Re-writing update function in matrix form:
- where \(\mathbf{\tilde{A}} = \mathbf{D}^{-1} \mathbf{A}\)
In practice, this implies that efficient sparse matrix multiplication can be used ( is sparse))
Note : not all GNNs can be expressed in matrix form, when
aggregation function is complex.
4.3.2 Train A GNN
Node embedding \(\boldsymbol{z}_v\) is a function of input graph
Supervised setting : minimize the loss
-
\(\boldsymbol{y}\) node label
-
\(\mathcal{L}\) could be L2 if \(\boldsymbol{y}\) is real number
- cross entropy if \(\boldsymbol{y}\) is categorical
Unsupervised setting :
-
No node label available
-
Use the graph structure as the supervision
4.3.3 Unsupervised Training
"Similar" nodes have similar embeddings
-
where \(\boldsymbol{y}_{u, v} = 1\) when node \(u\) and \(v\) are similar
-
\(\text{CE}\) is the cross entropy
-
\(\text{DEC}\) is the decoder such as inner product
Node similarity can be anything from Lecture 3, e.g., a loss based on:
-
Random walks (node2vec, DeepWalk, struc2vec)
-
Matrix factorization
-
Node proximity in the graph
4.3.4 Supervised Training
Directly train the model for a supervised task (e.g., node classification)
- Use cross entropy loss
-
where \(y_{v}\) : the node class label
-
\(\boldsymbol{z}_{v}\) : encoder output: node embedding
-
\(\boldsymbol{\theta}\) : classification weights
4.4 Model Design
(1) Define a neighborhood aggregation function
(2) Define a loss function on the embeddings
(3) Train on a set of nodes, i.e., a batch of compute graphs
(4) Generate embeddings for nodes as needed
- Even for nodes we never trained on
4.5 Inductive Capability
The same aggregation parameters are shared for all nodes:
- The number of model parameters is sublinear in \(|V|\) and we can generalize to unseen nodes!
New Graph
New Nodes
5. GNNs subsume CNNs and Transformers
6. Summary
Basics of neural networks
- Loss, Optimization, Gradient, SGD, non-linearity, MLP
Idea for Deep Learning for Graphs
-
Multiple layers of embedding transformation
-
At every layer, use the embedding at previous layer as the input
-
Aggregation of neighbors and self-embeddings
Graph Convolutional Network
- Mean aggregation; can be expressed in matrix form
GNN is a general architecture
- CNN and Transformer can be viewed as a special GNN

浙公网安备 33010602011771号