An Introduction on Graph Neural Network

Node Embeddings

Map nodes to low-dimensional embeddings.

Goal, $similarity(u, v) \approx z_u ^Tz_v$ is to encode nodes so that similarity in the embedding space approximates similarity in the original network.

Two Key Components

Encoder maps each node to a low-dimensional vector.
Similarity function specifies how relationships in embedding space map to relationships in the original network.

From "Shallow" to "Deep"

shallow encoders, i.e. One-hot encoder

Limitations of shallow encoding:

$O(\vert V \vert )$ parameters are needed: there is no parameter sharing and every node has its own unique embedding vector.
Inherently "transductive": It is impossible to generate embeddings for nodes that were not seen during training.
Do not incorporate node features: Many graphs have features that we can and should leverage.

We will now discuss "deeper" methods based on graph neural networks. $ENC(v)$ function is a complex function that depends on graph structure. In general, all of these more complex encoders can be combined with the similarity functions.

The Basics: Graph Neural Networks

Setup

Assume we have a graph $G$:

$V$ is the vertex set.
$A$ is the adjacency matrix (assume binary).
$X \in R ^{m \times \vert V \vert}$ is a matrix of node features.
- Categorical attributes, text, image data, e.g., profile information in a social network.
- Node degrees, clustering coefficients, etc.
- Indicator vectors (i.e., one-hot encoding of each node).

Neighborhood Aggregation

Key idea: Generate node embeddings based on local neighborhoods.

Intuition

Nodes aggregate information from their neighbors using neural networks.

Network neighborhood defines a computation graph.
Every node defines a unique computation graph.
Nodes have embeddings at each layer.
Model can be arbitrary depth.
"layer-0" embedding of node $u$ is its input feature, i.e. $x_u$.

Neighborhood aggregation can be viewed as a center-surround filter.

Neighborhood aggregation mathematically related to spectral graph convolutions.
Key distinctions are in how different approaches(the boxes between layers) aggregate information across the layers.

Methods

Basic approach:
Average neighbor information and apply a neural network.

Math: Expression (1)
$h_v^0 = x_v$
$h_v^k = \sigma (W_k \sum _{u \in N(v)} \frac{h_u^{k-1}}{\vert N(v) \vert } + B_k h_v^{k-1}), \forall k > 0$

Traning the Model

What we should learn?

In Expression (1), $W_k, B_k$ are the trainable matrices
After K-layers of neighborhood aggregation, we get output embeddings for each node.

Loss function

We can feed these embeddings into any loss function and run stochastic gradient descent to train the aggregation parameters.
Train in an unsupervised manner using only the graph structure.
Unsupervised loss function can be based on
- Random walks (node2vec, DeepWalk)
- Graph factorization

Alternative

Directly train the model for a supervised task (e.g., node classification)

Overview of Model Design

Define a neighborhood aggregation function, i.e. neural network.
Define a loss function on the embeddings, $L(z_u)$.
Train on a set of nodes, i.e., a batch of compute graphs. Note that not all nodes are used for training!
Generate embeddings for all nodes (Even for nodes we never trained on).

Inductive Capability

The same aggregation parameters are shared for all nodes.

The number of model parameters is sublinear in $\vert V \vert $ and we can generalize to unseen nodes.

Inductive node embedding $\Rightarrow$ generalize to entirely unseen graphs, e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

posted @ 2020-03-05 23:16 问李白买酒阅读(324) 评论(0) 收藏举报

刷新页面返回顶部

知北遊

吾生也有涯，而知也无涯