Machine Learning with Graphs : 9 Theory of Graph Neural Networks

Stanford CS224W : Machine Learning with Graphs, Fall 2021:

9 Theory of Graph Neural Networks
site

1. Setting-up GNN Prediction Tasks

1.1 GNN Training Pipelines

1.2 Dataset Split: Fixed / Random Split

Fixed split: We will split our dataset once

Training set: used for optimizing GNN parameters
Validation set: develop model/hyperparameters
Test set: held out until we report final performance

A concern: sometimes we cannot guarantee that the test set will really be held out

Random split: we will randomly split our dataset into training / validation / test

We report average performance over different random seeds

1.3 Splitting Graphs is Special

Suppose we want to split an image dataset

Image classification: Each data point is an image
Here data points are independent
- Image 5 will not affect our prediction on image 1

Suppose a graph dataset with 6 nodes:

Training dataset: node 1 and node 2
Validation dataset: node 3 and node 4
Test dataset: node 5 and node 6

Splitting a graph dataset is different

Node classification: Each data point is a node
Here data points are NOT independent
- Node 5 will affect our prediction on node 1, because it will participate in message passing -> affect node 1's embedding

1.3.1 Solution 1: Transductive setting

Solution 1 (Transductive setting): The input graph can be observed in all the dataset splits (training, validation and test set).

We will only split the (node) labels
- At training time, we compute embeddings using the entire graphs, and train using node 1&2's labels
- At validation time, we compute embeddings using the entire graph, and evaluate on node 3&4's labels

1.3.2 Solution 2: Inductive setting

Solution 2 (Inductive setting): We break the edges between splits to get multiple graphs

Now we have 3 graphs that are independent. Node 5 will
not affect our prediction on node 1 any more
At training time, we compute embeddings using the graph over node 1&2, and train using node 1&2's labels
At validation time, we compute embeddings using the graph over node 3&4, and evaluate on node 3&4's labels

1.3.3 Transductive v.s. Inductive Settings

Transductive setting: training / validation / test sets are on the same graph

The dataset consists of one graph
The entire graph can be observed in all dataset splits, we only split the labels
Only applicable to node / edge prediction tasks

Inductive setting: training / validation / test sets are on different graph

The dataset consists of multiple graphs
Each split can only observe the graph(s) within the split. A successful model should generalize to unseen graphs
Applicable to node / edge / graph tasks

1.4 Example

1.4.1 Node Classification

Transductive node classification:

All the splits can observe the entire graph structure, but
can only observe the labels of their respective nodes

Inductive node classification:

Suppose we have a dataset of 3 graphs
Each split contains an independent graph

1.4.2 Graph Classification

Only the inductive setting is well defined for graph classification

Because we have to test on unseen graph
Suppose we have a dataset of 5 graphs. Each split will contain independent graph(s)

1.4.3 Link Prediction

Goal of link prediction: predict missing edges

Setting up link prediction is tricky:

Link prediction is an unsupervised / self-supervised task. We need to create the labels and dataset splits on our own
Concretely, we need to hide some edges from the GNN and the let the GNN predict if the edges exist

1.5 Setting up Link Prediction

For link prediction, we will split edges twice

Step 1: Assign 2 types of edges in the original graph

Message edges: Used for GNN message passing
Supervision edges: Used for computing objectives
After step 1:
- Only message edges will remain in the graph
- Supervision edges are used as supervision for edge predictions made by the model, will not be fed into GNN!

Step 2: Split edges into train / validation / test

Option 1: Inductive link prediction split

Suppose we have a dataset of 3 graphs. Each inductive split will contain an independent graph
In train or valid or test set, each graph will have 2 types of edges: message edges + supervision edges
- Supervision edges are not the input to GNN

Option 2: Transductive link prediction split:

This is the default setting when people talk about link prediction
Suppose we have a dataset of 1 graph
By definition of "transductive", the entire graph can be observed in all dataset split
- But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges
- To train the training set, we further need to hold out supervision edges for the training set
Details (refer to the original lecture notes):
- At training time: Use training message edges to predict training supervision edges
- At validation time: Use training message edges & training supervision edges to predict validation edges
- At test time: Use training message edges & training
  supervision edges & validation edges to predict test edges

Summary: Transductive link prediction split

Note: Link prediction settings are tricky and complex. You
may find papers do link prediction differently. But if you follow our reasoning steps, this should be the right way to implement link prediction

Implementation resources :

DeepSNAP provides core modules for this pipeline
GraphGym further implements the full pipeline to facilitate GNN design

2. When Things Don't Go As Planned

2.1 General Tips

Data preprocessing is important :

Node attributes can vary a lot!
- E.g. probability ranges (0,1), but some inputs could have much
  larger range, say (−1000, 1000)
Use normalization

Optimizer :

ADAM is relatively robust to learning rate

Activation function :

ReLU activation function often works well
Other alternatives: LeakyReLU, SWISH, rational activation
No activation function at your output layer:

Include bias term in every layer

Embedding dimensions :

32, 64 and 128 are often good starting points

2.2 Debugging Deep Networks

Debug issues : Loss / accuracy not converging during training

Check pipeline (e.g. in PyTorch we need zero_grad)
Adjust hyperparameters such as learning rate
Pay attention to weight parameter initialization

Important for model development:

Overfit on (part of) training data:
- With a small training dataset, loss should be essentially close
  to 0, with an expressive neural network
- If neural network cannot overfit a single data point, something
  is wrong
Scrutinize (详细检查) loss function
Scrutinize visualizations

2.3 Resource on Graph Neural Networks

PyTorch

PyG: GraphGym, Easy and flexible implementation support based on PyTorch Geometric
DGL: Depp Graph Library

3. How Expressive are Graph Neural Networks

3.1 Idea Aggregate Neighbors

Key idea: Generate node embeddings based on local network neighborhoods

Intuition: Nodes aggregate information from their neighbors using neural networks

3.2 Theory of GNNs

How powerful are GNNs?

Many GNN models have been proposed (e.g., GCN, GAT, GraphSAGE, Design Space).
What is the expressive power (ability to distinguish different graph structures) of these GNN models?
How to design a maximally expressive GNN model

Different GNN models use different neural networks in the box

GCN: mean-pool (Kipf and Welling ICLR 2017)
- Element-wise mean pooling + Linear + ReLU non-linearity
GraphSAGE: max-pool (Hamilton et al. NeurIPS 2017)
- MLP + element-wise max-pooling

3.3 Local Neighborhood Structures

See the pages 35-39 of the lecture notes 9

3.4 Computaional Graph

See the pages 40-52 of the lecture notes 9

4. Designing the Most Powerful Graph Neural Network

4.1 Expressive Power of GNNs

Key observation: Expressive power of GNNs can be characterized by that of neighbor aggregation functions they use.

A more expressive aggregation function leads to a more expressive a GNN.
Injective aggregation function (one-to-one function, 单射的) leads to the most expressive GNN.

4.2 Neighbor Aggregation

Observation: Neighbor aggregation can be abstracted as a function over a multi-set (a set with repeating elements).

Aggregation functions: of two popular GNN models

4.2.1 Case Study: GCN

GCN: mean-pool (Kipf and Welling ICLR 2017)

Uses element-wise mean pooling over neighboring node features

\[\text{Mean} \left( \{ x_u \}_{u \in N(v)} \right) \]

Followed by linear function and ReLU activation, i.e., \(\max(0, x)\)

Theorem (Xu et al. ICLR 2019): GCN's aggregation function cannot distinguish different multi-sets with the same color proportion.

4.2.2 Case Study: GraphSAGE

GraphSAGE: max-pool (Hamilton et al. NeurIPS 2017)

Uses element-wise max pooling over neighboring node features

\[\text{Max} \left( \{ x_u \}_{u \in N(v)} \right) \]

Theorem (Xu et al. ICLR 2019): GraphSAGE's aggregation function cannot distinguish different multi-sets with the same set of distinct colors.

4.2.3 Summary

Expressive power of GNNs can be characterized by that of the neighbor aggregation function.

Neighbor aggregation is a function over multi-sets (sets with repeating elements)

GCN and GraphSAGE's aggregation functions fail to distinguish some basic multi-sets; hence not injective.

Therefore, GCN and GraphSAGE are not maximally powerful GNNs.

4.3 Designing Most Expressive GNNs

Our goal : Design maximally powerful GNNs in the class of message-passing GNNs.

This can be achieved by designing injective neighbor aggregation function over multi-sets

4.3.1 Injective Multi-Set Function

Theorem (Xu et al. ICLR 2019): Any injective multi-set function can be expressed as:

\[\Phi \left( \sum_{x \in S} f(x) \right) \]

where \(\Phi(\cdot)\) is some non-linear function
\(\sum \limits_{x \in S}\) represents summation over multi-set
\(f(x)\) represents some non-linear function

Proof Intuition (Xu et al. ICLR 2019) : \(f\) produces one-hot encodings of colors. Summation of the one-hot encodings retains all the information about the input multi-set.

Use a Multi-Layer Perceptron (MLP) to model \(\Phi\) and \(f\) in \(\Phi \Big(\sum \limits_{s \in S} f(x) \Big)\)

Then, we have arrived at a neural network that can model any injective multiset function.

\[\operatorname{MLP}_{\Phi} \left( \sum_{x \in S} \operatorname{MLP}_{f}(x) \right) \]

In practice, MLP hidden dimensionality of 100 to 500 is sufficient.

4.3.2 Universal Approximation Theorem

Theorem: Universal Approximation Theorem ([Hornik et al., 1989)

1-hidden-layer MLP with sufficiently-large hidden dimensionality and appropriate non-linearity \(\sigma(\cdot)\) (including ReLU and sigmoid) can approximate any continuous function to an arbitrary accuracy.

4.4 Most Expressive GNN: Graph Isomorphism Network (GIN)

Graph Isomorphism Network (GIN) (Xu et al. ICLR 2019)

Apply an MLP, element-wise sum, followed by another MLP.

\[\operatorname{MLP}_{\Phi} \left( \sum_{x \in S} \operatorname{MLP}_{f}(x) \right) \]

Theorem (Xu et al. ICLR 2019): GIN's neighbor aggregation function is injective.

No failure cases
GIN is THE most expressive GNN in the class of message-passing GNNs!

The full model of GIN by relating it to Weisfeiler-Lehman (WL) graph kernel (traditional way of obtaining graph-level features)

GIN is a "neural network" version of the WL graph kernel.

4.4.1 Relation to WL Graph Kernel

Color refinement algorithm in WL kernel

Given : A graph \(G\) with a set of nodes \(V\)

Assign an initial color \(c^{(0)}(v)\) to each node \(v\)
Iteratively refine node colors by

\[C^{(k+1)}(v) = \text{HASH} \left( c^{(k)}(v), \left\{ c^{(k)}(u)\right\}_{u \in N(v)} \right) \]
where HASH maps different inputs to different colors.
After \(K\) steps of color refinement, \(c^{(k)}(v)\) summarizes the structure of \(K\)-hop neighborhood

Two graphs are considered isomorphic if they have the same set of colors.

4.4.3 The Complete GIN Model

GIN uses a neural network to model the injective HASH function.

\[C^{(k+1)}(v) = \text{HASH} \left( c^{(k)}(v), \ \left\{ c^{(k)}(u)\right\}_{u \in N(v)} \right) \]

Specifically, we will model the injective function over the tuple:

\[\Big( \underbrace{c^{(k)}(v)}_{ \begin{array}{c} \text{Root} \\ \text{node feature} \end{array} }, \ \underbrace{\left\{ c^{(k)}(u) \right\}_{u \in N(v)}}_{ \begin{array}{c} \text{Neighboring} \\ \text{node feature} \end{array} } \Big) \]

Theorem (Xu et al. ICLR 2019): Any injective function over the tuple \(\Big(c^{(k)}(v), \{ c^{(k)}(u) \}_{u \in N(v)} \Big)\) can be modeled as:

\[\operatorname{MLP}_{\Phi} \left( (1+\epsilon) \cdot \operatorname{MLP}_{f} \left(c^{(k)}(v) \right) + \sum_{u \in N(v)} \operatorname{MLP}_{f}\left(c^{(k)}(u) \right) \right) \]

where \(\epsilon\) is a learnable scalar.

If input feature \(c^{(0)}(v)\) is represented as one-hot, direct summation is injective. We only need \(\Phi\) to ensure the injectivity.

\[\text{GINConv} \left( c^{(k)}(v), \ \left\{ c^{(k)}(u) \right\}_{u \in N(v)} \right) = \text{MLP}_{\Phi} \left( (1+\epsilon) \cdot c^{(k)}(v) + \sum_{u \in N(v)} c^{(k)}(u) \right) \]

Where the \(\text{MLP}\) operator can provide "one-hot" input feature for the next layer

4.4.4 GIN's node embedding updates

Given: A graph \(G\) with a set of nodes \(V\)

Assign an initial vector \(c^{(0)}(v)\) to each node \(v\)
Iteratively update node vectors by

\[c^{(k+1)}(v) = \operatorname{GINConv}\left( \left\{c^{(k)}(v), \ \left\{c^{(k)}(u)\right\}_{u \in N(v)}\right\} \right) \]

where \(\text{GINConv}\) maps different inputs to different embeddings.
After \(K\) steps of GIN iterations, \(c^{(K)}(v)\) summarizes the structure of \(K\)-hop neighborhood.

4.4.5 GIN and WL Graph Kernel

GIN can be understood as differentiable neural version of the WL graph Kernel:

	Update target	Update function
WL Graph Kernel	Node colors (one-hot)	HASH
GIN	Node embeddings (low-dim vectors)	GINConv

Advantages of GIN over the WL graph kernel are:

Node embeddings are low-dimensional; hence, they can capture the fine-grained (细纹理，细粒性分布) similarity of different nodes.
Parameters of the update function can be learned for the downstream tasks.

4.4.6 Expressive Power of GIN

Because of the relation between GIN and the WL graph kernel, their expressive is exactly the same.

If two graphs can be distinguished by GIN, they can be also distinguished by the WL kernel, and vice versa.

How powerful is this

WL kernel has been both theoretically and empirically shown to distinguish most of the real-world graphs (Cai et al. 1992).
Hence, GIN is also powerful enough to distinguish
most of the real graphs!

4.5 Improving GNNs' Power

There are basic graph structures that existing GNN framework cannot distinguish, such as difference in cycles.

GNNs' expressive power can be improved to resolve the above problem. (You et al. AAAI 2021, Li et al. NeurIPS 2020)

5. Summary

GNNs and connection to bijective functions on sets.

Most powerful GNN is equivalent to WL graph isomorphism test.

GIN is the most powerful GNN.

Sum aggregator is more powerful than mean is more powerful than max.

References

Xu, Keyulu et al., How Powerful Are Graph Neural Networks? arxiv.

posted @ 2022-07-17 19:41 veager 阅读(487) 评论(0) 收藏举报

刷新页面返回顶部

veager