Machine Learning with Graphs : 3 Node Embeddings

Machine Learning with Graphs : 3 Node Embeddings

1. Overview

Graph Representation Learning alleviates the need to do feature engineering every single time.

  • Representation Learning: Automatically learn the features

Goal: Efficient task-independent feature learning for machine learning with graphs!

Task: Map nodes into an embedding space

  • Similarity of embeddings between nodes indicates their similarity in the network. For example:

    • Both nodes are close to each other (connected by an edge)
  • Encode network information

  • Potentially used for many downstream predictions

2. Encoder and Decoder

Assumptions:

  • \(G\) : a graph

  • \(V\) : vertex set

  • \(\mathbf{A}\) : adjacency matrix (assume binary)

  • For simplicity: No node features or extra information is used

2.1 Embedding Nodes

Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph, i.e.:

\[\text{similarity}(u,v) \approx \boldsymbol{z}_u^{\top} \boldsymbol{z}_v \]

  • where \(\text{similarity}(u,v)\) is a function measuring the similarity between node \(u\) and \(v\) in the original network.

  • \(\text{ENC}(u) = \boldsymbol{z}_u, \forall u \in V\) and \(\text{ENC}(u)\) is an encoding function;

2.1.1 Learning Node Embeddings

  • Encoder maps from nodes to embeddings

  • Define a node similarity function (i.e., a measure of similarity in the original network)

  • Decoder \(\text{DEC}\) maps from embeddings to the similarity score

  • Optimize the parameters of the encoder so that:

    \[\text{similarity}(u,v) \approx \boldsymbol{z}_u^{\top} \boldsymbol{z}_v \]

2.1.2 Two Key Components

Encoder : maps each node to a low-dimensional vector

\[\text{ENC}(v) = \boldsymbol{z}_v \]

  • where: \(v\) is the node in the input graph

  • \(\boldsymbol{z}_v\) is a \(d\)-dimensional embedding

Similarity function : specifies how the relationships in vector space map to the relationships in the original network

\[\text{similarity}(u,v) \approx \boldsymbol{z}_u^{\top} \boldsymbol{z}_v \]

  • where: \(\text{similarity}(u,v)\) represents similarity of \(u\) and \(v\) in the original network.

  • \(\boldsymbol{z}_u^{\top} \boldsymbol{z}_v\) is the dot product between node embeddings

2.2 "Shallow" Encoding

Simplest encoding approach: Encoder is just an embedding-lookup

\[\text{ENC}(v) = \boldsymbol{z}_v = \mathbf{Z} \cdot \boldsymbol{v} \]

  • where \(\mathbf{Z} \in \mathbf{R}^{d \times |V|}\) is amatrix, each column is a node embedding (what we learn / optimize)

  • \(\boldsymbol{v} \in \mathbb{I} ^{|V|}\) is a indicator vector, all zeroes except a one in column indicating node \(v\)

Each node is assigned a unique embedding vector (i.e., we directly optimize the embedding of each node)

  • Many methods: DeepWalk, node2vec

2.3 Framework

Encoder + Decoder Framework

  • Shallow encoder: embedding lookup

  • Parameters to optimize: 𝐙 which contains node embeddings 𝐳𝑢 for all nodes 𝑢 ∈ 𝑉

  • deep encoders (GNNs)

  • Decoder : based on node similarity.

  • Objective : maximize \(\boldsymbol{z}_v^{\top} \boldsymbol{z}_u\) for node pairs \((u,v)\) that are similar

2.4 Define Node Similarity

Key choice of methods is how they define node similarity.

Should two nodes have a similar embedding if they ...

  • are linked?

  • share neighbors?

  • have similar "structural roles"?

2.5 Note on Node Embeddings

This is unsupervised/self-supervised way of learning node embeddings

  • We are not utilizing node labels

  • We are not utilizing node features

  • The goal is to directly estimate a set of coordinates (i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved.

These embeddings are task independent

  • They are not trained for a specific task but can be used for any task.

3. Random Walk Approaches for Node Embeddings

Notation:

  • Vector \(\boldsymbol{z}_u\) : the embedding of node \(u\) (what we aim to find)

  • Probability \(P(v|\boldsymbol{z}_u)\) : The (predicted) probability of visiting node \(v\) on random walks starting from node \(u\).

    • The generated model prediction based on \(\boldsymbol{z}_u\)

Non-linear functions used to produce predicted probabilities

  • Softmax function:

    • Turns vector of \(K\) real values (model predictions) into \(K\) probabilities that sum to 1.

      \[\sigma(\boldsymbol{z})[i] = \frac{\exp(\boldsymbol{z}[i])}{\sum \limits_{j=1}^{K} \exp(\boldsymbol{z}[j])} \]

  • Sigmoid function:

    • \(S\)-shaped function that turns real values into the range of (0, 1). Written as

      \[S(x) = \frac{1}{1+\exp(-x)} \]

3.1 Random-Walk Embeddings

Random Walk : Given a graph and a starting point, we select a neighbor of it at random, and move to this neighbor; then we select a neighbor of this point at random, and move to it, etc.

The (random) sequence of points visited this way is a random walk on the graph.

$\boldsymbol{z}_v^{\top} \boldsymbol{z}_u \approx $ probability that \(u\) and \(v\) co-occur on a random walk over the graph

3.1.1 Procedures

Step 1 : Estimate probability of visiting node \(v\) on a random walk starting from node \(u\) using some random walk strategy \(R\) : \(P_R(v|u)\)

Step 2 : Optimize embeddings to encode these random walk statistics : \(\theta \propto P_R(v|u)\)

  • Similarity in embedding space (Here: dot product = \(\cos(𝜃)\)) encodes random walk "similarity".

3.1.2 Advantages of Random Walks

Expressivity : Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information

  • Idea: if random walk starting from node \(u\) visits \(v\) with high probability, \(u\) and \(v\) are similar (high-order multi-hop information)

Efficiency : Do not need to consider all node pairs when training;

  • only need to consider pairs that co-occur on random walks.

3.2 Unsupervised Feature Learning

Intuition : Find embedding of nodes in \(d\)-dimensional space that preserves similarity.

Idea : Learn node embedding such that nearby nodes are close together in the network

Given a node \(u\), define the nearby nodes:

  • \(N_R(u)\) : neighbourhood of \(u\) obtained by some random walk strategy \(R\)

3.2.1 Feature Learning as Optimization

Given \(G= (V, E)\)

Our goal is to learn a mapping \(f : u \to \mathbb{R}^d: f(u) = \boldsymbol{z}_u\)

Log-likelihood objective:

\[\max _{f} \sum_{u \in V} \log \mathrm{P} \left(N_{\mathrm{R}}(u) \mid \mathbf{z}_{u} \right) \]

  • where \(N_R(u)\) is the neighborhood of node \(u\) by strategpy \(R\)

Given node \(u\), we want to learn feature representations that are predictive of the nodes in its random walk neighborhood \(N_R(u)\).

3.3 Random Walk Optimization

  • Step 1 : Run short fixed-length random walks starting from each node \(u\) in the graph using some random walk strategy \(R\)

  • Step 2 : For each node \(u\) collect \(N_R(u)\), the multi-set (\(N_R(u)\) can have repeat elements since nodes can be visited multiple times on random walks) of nodes visited on random walks starting from \(u\)

  • Step 3 : Optimize embeddings according to: Given node \(u\), predict its neighbors \(N_R(u)\).

Maximum likelihood objective:

\[\max _{f} \sum_{u \in V} \log \mathrm{P} \left(N_{\mathrm{R}}(u) \mid \mathbf{z}_{u} \right) \]

or equivalently,

\[\mathcal{L} = \sum_{u \in V} \sum_{v \in N_{R}(u)}-\log \left(P\left(v \mid \mathbf{z}_{u}\right)\right) \]

Intuition : Optimize embeddings \(\boldsymbol{z}_u\) to maximize the likelihood of random walk co-occurrences.

Parameterize \(\mathrm{P}(v|\boldsymbol{z}_u)\) using softmax:

\[\Pr \left(v \mid \mathbf{z}_{u}\right) = \frac{\exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{v})}{ \sum \limits_{n \in V} \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{n})} \]

Why softmax: We want node \(v\) to be most similar to node \(u\) (out of all nodes \(n\)). Intuition: \(\sum \limits_i \exp(x_i) \approx \max \limits_i \exp(x_i)\)

Putting it all together:

\[\mathcal{L} = \sum_{u \in V} \sum_{v \in N_{R}(u)} - \log \left[ \frac{\exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{v})}{ \sum \limits_{n \in V} \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{n}) } \right] \]

  • where \(\mathrm{P} (v \mid \mathbf{z}_{u}) = \frac{\exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{v})}{ \sum_{n \in V} \exp(\mathbf{z}_{u}^{\top} \mathbf{z}_{n})}\) is the predicted probability of \(v\) co-occuring on random walk

  • \(\sum \limits_{v \in N_{R}(u)}\) sum over nodes \(v\) seen on random walks starting from \(u\)

  • \(\sum \limits_{u \in V}\) sum over all nodes \(u \in V\)

\[\text{Optimizing random walk embeddings} = \text{Finding embeddings $\boldsymbol{z}_u$ that minimize $\mathcal{L}$} \]

Doing this naively is too expensive: nested sum over nodes gives \(\mathcal{O}(|V|^2)\) complexity.

3.4 Negative Sampling

The normalization term \(\sum_{n \in V} \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{n})\) from the softmax is the culprit ... can we approximate it?

Solution: Negative sampling

\[\log \left[ \frac{ \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{v}) }{ \sum_{n \in V} \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{n} ) } \right] \approx \log \left[ \sigma (\mathbf{z}_{u}^{\top} \mathbf{z}_{v} ) \right] - \sum_{i=1}^{k} \log \left[ \sigma (\mathbf{z}_{u}^{\top} \mathbf{z}_{n_{i}} ) \right], \quad \text{where} \quad n_{i} \sim \mathbb{P}_{V} \]

where:

  • \(\mathbb{P}_{V}\) is the random distribution over nodes

Instead of normalizing w.r.t. all nodes, just normalize against \(k\) random "negative samples" \(n_i\).

  • Negative sampling allows for quick likelihood calculation.

Sample \(k\) negative nodes each with probability proportional to its degree.

Two considerations for \(k\) (the number of negative samples):

  • Higher \(k\) gives more robust estimates

  • Higher \(k\) corresponds to higher bias on negative events

In practice, \(k = 5 \sim 20\).

3.4.1 More details about negative sampling

Why is the approximation valid? (Goldberg & Levy, 2014)

  • Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approximate maximizes the log probability of softmax.

  • New formulation corresponds to using a logistic regression (sigmoid function) to distinguish the target node \(v\) from nodes \(n_i\) sampled from background distribution \(\mathbb{P}_v\).

Can negative sample be any node or only the nodes not on the walk?

  • People often use any nodes (for efficiency). However, the most "correct" way is to use nodes not on the walk.

3.5 Stochastic Gradient Decent

3.5.1 Objective function

Minimizing the objective function:

\[\mathcal{L} = \sum_{u \in V} \sum_{v \in N_{R}(u)} - \log \left[ \frac{\exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{v})}{ \sum \limits_{n \in V} \exp (\mathbf{z}_{u}^{\top} \mathbf{z}_{n}) } \right] \]

3.5.2 Gradient Descent

Gradient Descent : a simple way to minimize \(\mathcal{L}\)

  • Initialize \(\boldsymbol{z}_u\) at some randomized value for all nodes \(u\).

  • Iterate until convergence:

    • For all \(u\), compute the derivative \(\frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_u}\)

    • For all \(u\), make a step in reverse direction of derivative:

    \[\boldsymbol{z}_u \leftarrow \boldsymbol{z}_u - \eta \frac{\partial \mathcal{L}}{\partial \boldsymbol{z}_u} \]

    where \(eta\) is the learning rate

3.5.3 Stochastic Gradient Descent

Stochastic Gradient Descent : Instead of evaluating gradients over all examples, evaluate it for each individual training example

  • Initialize \(\boldsymbol{z}_u\) at some randomized value for all nodes \(u\).

  • Iterate until convergence:

    \[\mathcal{L}^{(u)} = \sum_{v \in N_{R}(u)} - \log \big[\mathrm{P} (v | \boldsymbol{z}_u) \big] \]

    • Sample a node \(u\), for all \(v\) calculate the derivative \(\dfrac{\partial \mathcal{L}^{(u)}}{\partial \boldsymbol{z}_u}\)

    • For all \(v\), update:

    \[\boldsymbol{z}_v \leftarrow \boldsymbol{z}_v - \eta \frac{\partial \mathcal{L}^{(u)}}{\partial \boldsymbol{z}_v} \]

3.6 Strategies of random walks

  • Just run fixed-length

  • Unbiased random walks starting from each node (DeepWalk from Perozzi et al., 2013)

Issue : such notion of similarity is too constrained

How to generalize this?

3.7 Random Walks: Summary

  • Run short fixed-length random walks starting from each node on the graph

  • For each node \(u\) collect \(N_R(u)\), the multiset of nodes visited on random walks starting from \(u\).

  • Optimize embeddings using Stochastic Gradient Descent:

    \[\mathcal{L} = \sum_{u \in V} \sum_{v \in N_{R}(u)} - \log \big[ P ( v \mid \mathbf{z}_{u} ) \big] \]

    We can efficiently approximate this using negative sampling

3. node2vec

3.1 Overview of node2vec

Goal : Embed nodes with similar network neighborhoods close in the feature space.

We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task.

Key observation: Flexible notion of network neighborhood \(N_R(u)\) of node \(u\) leads to rich node embeddings

Develop biased 2nd order random walk \(R\) to generate network neighborhood \(N_R(u)\) of node \(u\).

3.2 node2vec: Biased Walks

Idea: use flexible, biased random walks that can
trade off between local and global views of the network (Grover and Leskovec, 2016).

Two classic strategies to define a neighborhood \(N_R(u)\) of a given node \(u\)

  • Breadth First Search (BFS): Local microscopic view

  • Depth First Search (DFS): Global macroscopic view

3.2.1 Interpolation BFS and DFS

Two parameters:

  • Return parameter \(p\):

    • Return back to the previous node
  • In-out parameter \(q\):

    • Moving outwards (DFS) vs. inwards (BFS)

    • Intuitively, \(q\) is the "ratio" of BFS vs. DFS

3.2.2 Biased Random Walks

Example: Biased 2nd-order random walks explore network neighborhoods

3.3 node2vec algorithm

  • Step 1: Compute random walk probabilities

  • Step 2: Simulate \(r\) random walks of length \(l\) starting from each node \(u\)

  • Step 3: Optimize the node2vec objective using Stochastic Gradient Descent

Linear-time complexity

All 3 steps are individually parallelizable

3.4 Other Random Walk Ideas

Different kinds of biased random walks:

  • Based on node attributes (Dong et al., 2017).

  • Based on learned weights (Abu-El-Haija et al., 2017)

Alternative optimization schemes:

  • Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).

Network preprocessing techniques:

  • Run random walks on modified versions of the original network (e.g., struct2vec from Ribeiro et al. 2017's, HARP from Chen et al. 2016).

3.5 Summary

Core idea : Imbed nodes so that distances in
embedding space reflect node similarities in the original network

Different notions of node similarity:

  • Naïve: similar if two nodes are connected

  • Neighborhood overlap

  • Random walk approaches

No one method wins in all cases

  • E.g., node2vec performs better on node classification while alternative methods perform better on link prediction (Goyal and Ferrara, 2017 survey)

  • Random walk approaches are generally more efficient.

In general: Must choose definition of node similarity that matches your application.

4. Embedding Entire Graphs

Goal: Want to embed a subgraph or an entire
graph \(G\). Graph embedding: \(\boldsymbol{z}_G\) (node2vec from Grover and Leskovec, 2016)

Tasks:

  • Classifying toxic vs. non-toxic molecules

  • Identifying anomalous (反常的,异常的) graphs

4.1 Approach 1

Simple but effective

  • Run a standard graph embedding technique on the (sub)graph \(G\).

  • Then just sum (or average) the node embeddings in the (sub)graph \(G\).

\[\boldsymbol{z}_G = \sum_{v \in G} \boldsymbol{z}_v \]

Used by Duvenaud et al. (2016) to classify molecules based on their graph structure

4.2 Approach 2

Introduce a "virtual node" to represent the (sub)graph and run a standard graph embedding technique.

Proposed by Li et al. (2016) as a general technique for subgraph embedding

4.3 Approach 3: Anonymous Walk Embedding

States in anonymous walks correspond to the index of the first time we visited the node in a random walk (Ivanov and Burnaev, 2018)

5. Summary

Approach 1: Embed nodes and sum/avg them

Approach 2: Create super-node that spans the (sub) graph and then embed that node.

Approach 3: Anonymous Walk Embeddings

  • Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs

  • Idea 2: Learn graph embedding together with anonymous walk embeddings.

References

Yoav Goldberg and Omer Levy, word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method, arxiv

posted @ 2022-07-10 19:35  veager  阅读(155)  评论(0)    收藏  举报