Machine Learning with Graphs : 2 Traditional Methods for ML on Graphs

Stanford CS224W : Machine Learning with Graphs, Fall 2021:

2 Traditional Methods for ML on Graphs
site

1. Introduction

Traditional ML pipeline uses hand-designed features.

Traditional features for:

Node-level prediction
Link-level prediction
Graph-level prediction

Machine Learning in Graphs

Goal : Make predictions for a set of objects

Design choices :

Features : $d$-dimensional vectors
Objects ： Nodes, edges, sets of nodes, entire graphs
Objective function : What task are we aiming to solve?

Example : node-level prediction

Given: $G = (V, E)$
Learn a function: $f: V \to \mathbb{R}$

2. Node-Level Tasks and Features

Goal: Characterize the structure and position of a node in the network:

Node degree
Node centrality
Clustering coefficient
Graphlets

2.1 Node Features: Node Degree

The degree $k_v$ of node $v$ is the number of edges (neighboring nodes) the node has.
Treats all neighboring nodes equally.
有向图（directed graph）: in-degree and out-degree

2.2 Node Features: Node Centrality

Node degree counts the neighboring nodes without capturing their importance.

Node centrality $c_v$ takes the node importance in a graph into account

Different ways to model importance:
Eigenvector centrality
Betweenness centrality
Closeness centrality
and many others ...

(1) Eigenvector centrality

A node $v$ is important if surrounded by important neighboring nodes $u \in N(v)$

We model the centrality of node $v$ as the sum of the centrality of neighboring nodes:

\[c_v = \frac{1}{\lambda} \sum_{u \in N(v)} c_u \]

where:

$N(v)$ represents the neighboring node set of the node $v$,
$\lambda$ is normalization constant (it will turn out to be the largest eigenvalue of $A$)

Notice that the above equation models centrality in a recursive manner.

Rewrite the recursive equation in the matrix form.

\[\lambda \ \boldsymbol{c} = \boldsymbol{A c} \]

where:

$\boldsymbol{A}$ is adjacency matrix, $\boldsymbol{A}_{u v} = 1$ if $u \in N(v)$
$\boldsymbol{c}$ is centrality vector, $\boldsymbol{c} = [c_1,c_2,\cdots,c_v, \cdots, c_N]^{\top}$
$\lambda$ is the eigenvalue

We can see that:

centrality $\boldsymbol{c}$ is the eigenvector of $\boldsymbol{A}$
The largest eigenvalue $\lambda_{\max}$ is always positive and unique (by Perron-Frobenius Theorem).
The eigenvector $\boldsymbol{c}_{\max}$ corresponding to $\lambda_{\max}$ is used for centrality.

(2) Betweenness centrality 介数中心性

A node is important if it lies on many shortest paths between other nodes:

\[c_v = \sum_{s \ne v \ne t} \frac{ \#(\text{shortest paths between $s$ and $t$ that contain $v$}) }{ \#(\text{shortest paths between $s$ and $t$}) } = \sum_{s, t \in V \text{ and } s \ne t} \frac{\#(s, t|v)}{\#(s, t)} \]

Understand shortest paths between $s$ and $t$: there may exist one more shortest path between $s$ and $t$.

代码

nx.betweenness_centrality(G, k=None, normalized=True, weight=None, endpoints=False, seed=None)

Parameters:

normalized=True, the betweenness values will be normalized by
- $2/((N-1)(N-2))$ for undirected graphs,
- and $1/((N-1)(N-2))$ for directed graphs
weight (None or string, optional, default=None): the weight used to calculate weighted shortest paths (i.e., distance).
- If None, all edge weights are considered equal.

nx.betweenness_centrality(G, 0)  # return a number; the index of node 0
nx.betweenness_centrality(G)     # return dict;

(3) Closeness centrality

A node is important if it has small shortest path lengths to all other nodes.

\[c_v = \frac{1}{\sum \limits_{u \ne v} \text{shortest path length between $u$ and $v$}} \]

代码

nx.closeness_centrality(G, u=None, distance=None, wf_improved=True) 其计算公式为：

\[c_v = \frac{N - 1}{\sum \limits_{u \in V, u \neq v} d(v, u)} \]

nx.closeness_centrality(G, 0)
nx.closeness_centrality(G)

2.3 Node Features: clustering coefficient

Clustering coefficient: Measures how connected $v$'s neighboring nodes are:

\[e_v = \frac{ \# (\text{ edges among neighboring nodes}) }{ \binom{k_v}{2} } \in [0, 1] \]

where $\binom{k_v}{2} = \frac{k_v (k_v - 1)}{2}$ is the number of node pairs among $k_v$ neighboring nodes (thus, for undirected graph, $k_v$ equals to the degree of node $v$).

Observation: clustering coefficient counts the $\#(\text{triangles})$ in the ego-network

2.4 Node Features: Graphlets

We can generalize clustering coefficient by counting $\#( \text{pre-specified subgraphs})$ (i.e., graphlets).

Goal: describe network structure around node $u$

Graphlets are small subgraphs that describe the structure of node $u$'s network neighborhood

2.4.1 Analogy

Degree counts $\#(\text{edges})$ that a node touches
Clustering coefficient counts $\#(\text{triangles})$ that a node touches.
Graphlet Degree Vector (GDV): Graphlet-base features for nodes; GDV counts $\#(\text{graphlets})$ that a node touches

Considering graphlets of size 2-5 nodes we get:

Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood

Graphlet degree vector provides a measure of a node's local network topology:

Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.

2.4.2 Induced Subgraph & Isomorphism

Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.

Graph Isomorphism (图同构): Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.

Graphlets : Rooted connected induced non-isomorphic subgraphs.

Graphlet Degree Vector (GDV) : A count vector of graphlets rooted at a given node.

2.5 Summary

Importance-based features : capture the importance of a node in a graph

Node degree: Simply counts the number of neighboring nodes
Different node centrality measures
- Models importance of neighboring nodes in a graph
- Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality

Structure-based features : Capture topological properties of local neighborhood around a node.

Node degree : Counts the number of neighboring nodes
Clustering coefficient : Measures how connected neighboring nodes are
Graphlet count vector : Counts the occurrences of different graphlets

3. Link-level Prediction Task and Features

3.1 Link Prediction: Overview

Link-level prediction task is to predict new links based on the existing links.

At test time, node pairs (with no existing links) are ranked, and top $K$ node pairs are predicted.

The key is to design features for a pair of nodes.

3.1.1 Link Prediction as a Task

Two formulations of the link prediction task:

Links missing at random : Remove a random set of links and then aim to predict them
Links over time : Given $G[t_0, t_0']$ a graph defined by edges up to time $t_0'$, output a ranked list $L$ of edges (not in $G[t_0, t_0']$) that are predicted to appear in time $G[t_1, t_1']$

Evaluation :

$n = |E_{\text{new}}|$: # new edges that appear during the test period $[t_1, t_1']$
Take top $n$ elements of $L$ and count correct edges

3.1.2 Link Prediction via Proximity

Methodology:

For each pair of nodes $(x,y)$ compute score $c(x,y)$
- For example, $c(x,y)$ could be the # of common neighbors of $x$ and $y$
Sort pairs $(x,y)$ by the decreasing score $c(x,y)$
Predict top $n$ pairs as new links
See which of these links actually appear in $G[t_1, t_1']$

3.2 Link-Level Features: Overview

Distance-based feature
Local neighborhood overlap
Global neighborhood overlap

3.2.1 Distance-Based Features

Shortest-path distance between two nodes.

However, this does not capture the degree of neighborhood overlap

3.2.2 Local Neighborhood Overlap

Captures # neighboring nodes shared between two nodes $v_1$ and $v_2$

Limitation of local neighborhood features

Metric is always zero if the two nodes do not have
any neighbors in common

(1) Common neighbors

\[|N(v_1) \cap N(v_2)| \]

where $N(v_1)$ represents the set of nodes adjacent to node $v_1$.

代码

networkx.common_neighbors(G, u, v) Returns the common neighbors

networkx.common_neighbor_centrality(G, ebunch=None, alpha=0.8): CCPA score, the equation is

\[\alpha \cdot \left( \left| N (u) \cap N (v) \right| \right) + (1 - \alpha ) \cdot \frac{N}{{d}_{uv}} \]

where $N (u) \cap N (v)$ denotes the intersection of $u$'s neighbors and $v$'s neighbors
$\alpha$ is parameter varies between $[0,1]$
$N$ is the total number of nodes
$d_{uv}$ is the shortest distance between node $u$ and $v$

实例： Common neighbors

点击查看代码

def common_neighbor_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    cn : dict
         key is a 2-tuples (u, v); value is the common neighbor index
    '''
    cn = {}
    for edge in G.edges:
        u, v = edge
        neigh = nx.common_neighbors(G, u, v)
        cn[(u, v)] = len(list(neigh))
    return cn
# ==================================================================

# compute the CN index of edge (0, 1)
cn = nx.common_neighbor_centrality(G, alpha=1., ebunch=[(0, 1)])
print(list(cn))
# compute the CN index of all edges
cn1 = nx.common_neighbor_centrality(G, alpha=1.)
print(list(sorted(cn1)))

cn2 = common_neighbor_(G)
print(cn2)

# test
total_error = 0.
for edge in G.edges:
    cn1 = nx.common_neighbor_centrality(G, alpha=1., ebunch=[edge])
    cn1 = list(cn1)[0][2]
    delta = abs(cn2[edge] - cn1)
    total_error = total_error + delta
print(total_error)

(2) Jaccard's coefficient

\[\frac{ \left| N (v_{1}) \cap N (v_{2}) \right| }{ \left |N (v_{1}) \cup N (v_{2}) \right|} \]

代码 `networkx.jaccard_coefficient(G, ebunch=None)`

实例： Jaccard's coefficient

点击查看代码

def jaccard_coefficient_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    jc : dict
         key is a 2-tuples (u, v); value is the Jaccard's coefficient
    '''
    jc = {}
    for edge in G.edges:
        u, v = edge
        neigh_u = set(G.neighbors(u))
        neigh_v = set(G.neighbors(v))
        jc[(u, v)] = len(neigh_u.intersection(neigh_v)) / len(neigh_u.union(neigh_v))
    return jc
# ============================================================================

# compute the JC index of edge (0, 1)
jc = nx.jaccard_coefficient(G, ebunch=[(0, 1)])
print(list(jc))
# compute the JC index of all edges
jc1 = nx.jaccard_coefficient(G)
print(list(jc1))

jc2 = jaccard_coefficient_(G)
print(jc2)

# test
total_error = 0.
for edge in G.edges:
    jc1 = nx.jaccard_coefficient(G, ebunch=[edge])
    jc1 = list(jc1)[0][2]
    delta = abs(jc2[edge] - jc1)
    total_error = total_error + delta
print(total_error)

(3) Adamic-Adar index

\[\sum_{u \in N\left(v_{1}\right) \cap N\left(v_{2}\right)} \frac{1}{\log \left(k_{u}\right)} = \sum_{u \in N\left(v_{1}\right) \cap N\left(v_{2}\right)} \frac{1}{\log \left( |N(u)| \right)} \]

where $k_u$ represents the number of neighbors of node $u$, i.e. $k_u = |N(u)|$

代码 `networkx.adamic_adar_index(G, ebunch=None)`

实例： Adamic-Adar index

点击查看代码

def adamic_adar_index_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    aai : dict
          key is a 2-tuples (u, v); value is the Adamic-Adar index
    '''
    aai = {}
    for edge in G.edges:
        u, v = edge
        common_neigh = list(nx.common_neighbors(G, u, v))
        if len(common_neigh) == 0.:
            aai[(u, v)] = 0.
        else:
            aai_node = [1 / np.log(len(list(G.neighbors(node)))) for node in common_neigh]
            aai[(u, v)] = sum(aai_node)
    return aai
# ================================================================================

# compute the AA index of edge (0, 1)
aai = nx.adamic_adar_index(G, ebunch=[edge])
print(list(aai))
# compute the AA index of all edges
aai1 = nx.adamic_adar_index(G)
print(list(aai1))

aai2 = adamic_adar_index_(G)
print(aai2)

total_error = 0.
for edge in G.edges:
    aai1 = nx.adamic_adar_index(G, ebunch=[edge])
    aai1 = list(aai1)[0][2]
    delta = abs(aai2[edge] - aai1)
    total_error = total_error + delta
print(total_error)

3.2.3 Global Neighborhood Overlap

Global neighborhood overlap metrics resolve the limitation by considering the entire graph.

(1) Katz index

Katz index : count the number of walks ($\#\text{walk}$) of all lengths between a given pair of nodes.

Powers of the graph adjacency matrix : Computing #walks between two nodes

$A_{uv} = 1$ if $u \in N(v)$
Let $P_{uv}^{(K)} = \# \text{walks}$ of length $K$ between $u$ and $v$. $P_{uv}^{(K)}$ 表示点 $u$ 到点 $v$ 经过 $K$ 步达到的路径（方案）个数，参考 Floyd 算法。
$P^{(K)} = A^k$

Compute $P_{uv}^{(2)}$:

Step 1 : Compute $\# \text{walk}$ of length 1 between each of $u$'s neighbor and $v$
Step 2 : Sum up these $\# \text{walks}$ across $u$'s neighbors
- $P^{(2)}_{uv} = \sum \limits_i A_{ui} \cdot P_{iv}^{(1)} = \sum \limits_i A_{ui} \cdot A_{iv} = A^2_{u v}$

Compute $\#\text{walk}$ between two nodes: use adjacency matrix powers

$A_{uv}$ specifies $\#\text{walk}$ of length $1$ (direct neighborhood) between $u$ and $v$.
$A_{uv}^{2}$ specifies $\#\text{walk}$ of length $2$ (neighbor of neighbor) between $u$ and $v$
$A_{uv}^{l}$ specifies $\#\text{walk}$ of length $l$

Katz index between $v_1$ and $v_2$ (from node $v_1$ to node $v_2$) is calculated as sum over all walk lengths

\[S_{v_1 v_2} = \sum_{l=1}^{\infty} \beta^l \, A_{v_1 v_2}^{l} \]

where $A_{v_1 v_2}^{l}$ is the $\#(\text{walk})$ of length $l$ between $v_1$ and $v_2$; $0 < \beta < 1$ is the discount factor.

Katz index matrix is computed in closed-form:

\[S = \sum_{i=1}^{\infty} \beta^i A^i = (I - \beta A)^{-1} - I \]

where $(I - \beta A)^{-1} = \sum \limits_{i=0}^{\infty} \beta^i A^i$ by geometric series of matrices (Hamilton, 2020, p.p. 18):

\[\sum_{i=0}^{\infty} A^0 = I + \beta A + \beta^2 A^2 + \ldots +\beta^{\infty} A^{\infty} = (I - \beta A)^{-1} \]

Katz index of nodes (a.k.a Katz centrality)

For a directed graph, it indicates the incoming edges (i.e., from all other nodes to node $i$):

\[x_i = \sum_{l=1}^{\infty} \sum_{j=1}^{N} \beta^l (A^l)_{ji} = \sum_{j=1}^{N} S_{ji} \]

matrix form

\[\boldsymbol{x} = ((I - \beta A^\top)^{-1} - I) \boldsymbol{1} = (I - \beta A^\top)^{-1} \boldsymbol{1} - \boldsymbol{1} \]

where $\boldsymbol{1}$ is the all-one vector.

3.3 Summary: Link-Level Features

Distance-based features:

Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.

Local neighborhood overlap:

Captures how many neighboring nodes are shared by two nodes.
Becomes zero when no neighbor nodes are shared.

Global neighborhood overlap:

Uses global graph structure to score two nodes.
Katz index counts $\#\text{walks}$ of all lengths between two nodes

4. Graph-Level Features and Graph Kernels

4.1 Kernel Methods

Kernel methods are widely-used for traditional ML for graph-level prediction.

Idea : Design kernels instead of feature vectors.

A quick introduction to Kernels:

Kernel $\kappa(G, G') \in \mathbb{R}$ measures similarity b/w data
Kernel matrix $\mathbf{K} = [\kappa(G, G')]_{G, G'}$ must always be positive semidefinite (i.e., has positive eigenvalues)
There exists a feature representation $\phi(\cdot)$ such that $\kappa(G, G') = \phi(G)^{\top} \phi(G')$
Once the kernel is defined, off-the-shelf ML model, such as kernel SVM, can be used to make predictions.

4.2 Graph-Level Features: Graph Kernel

Goal : We want features that characterize the structure of an entire graph

4.2.1 Overview

Graph Kernels : Measure similarity between two graphs:

Graphlet Kernel: Shervashidze, Nino, et al. "Efficient graphlet kernels for large graph comparison." Artificial Intelligence and Statistics. 2009
Weisfeiler-Lehman Kernel: Shervashidze, Nino, et al. "Weisfeiler-lehman graph kernels." Journal of Machine Learning Research 12.9 (2011).
Other kernels
- Random-walk kernel
- Shortest-path graph kernel
- And many more ...

4.2.2 Graph Kernel: Key Idea

Goal : Design graph feature vector $\phi(G)$

Key idea : Bag-of-Words (BoW) for a graph

Bag-of-Words (BoW): simply uses the word counts as features for documents (no ordering considered).
Naïve extension to a graph: Regard nodes as words.
Since both graphs have 4 red nodes, we get the
same feature vector for two different graphs.

Bag of node degrees

Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!

4.3 Graphlet Features

Key idea : Count the number of different graphlets in a graph.

Graphlet : Note that the definition of graphlets here is slightly different from node-level features.

The two differences are:
- Nodes in graphlets here do not need to be connected (allows for isolated nodes)
- The graphlets here are not rooted.
Examples: Let $\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})$ be a list of graphlets of size $k$
- For $k=3$, there are 4 graphlets
- For $k=4$, there are 11 graphlets

Graphlet count vector : Given graph $G$, and a graphlet list $\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})$, define the graphlet count vector $\boldsymbol{f}_G \in \mathbb{R}^{n_k}$ as:

\[(\boldsymbol{f}_G)_i = \#(g_i \subseteq G), \quad \forall i = 1,2,\cdots, n_k \]

Example:

4.3.1 Graphlet kernel

Graphlet kernel : Given two graphs, $G$ and $G'$, graphlet kernel is computed as:

\[\kappa(G, G') = \boldsymbol{f}_G^{\top} \, \boldsymbol{f}_{G'} \]

Problem: if $G$ and $G'$ have different sizes, that will greatly skew the value.

Solution: normalize each feature vector

\[\kappa(G, G') = \boldsymbol{h}_G^{\top} \, \boldsymbol{h}_{G'}, \qquad \text{where} \ \boldsymbol{h}_G = \frac{\boldsymbol{f}_G}{\text{Sum}(\boldsymbol{f}_G)} \]

Limitations: Counting graphlets is expensive!

4.3.2 Weisfeiler-Lehman Kernel

Goal : Design an efficient graph feature descriptor $\phi(G)$

Idea : Use neigborhood structure to iteratively enrich node vocabulary.

Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.

Algorithm to achieve this: Color refinement

Given : A graph $G$ with a set of nodes $V$.

Assign an initial color $c^{(0)}(v)$ to each node $v$
Iteratively refine node colors by

\[c^{(k+1)}(v) = \text{HASH} \left(\left\{c^{(k)}(v),\left\{c^{(k)}(u) \right\}_{u \in N(v)}\right\}\right) = \text{HASH} \left(\left\{c^{(k)}(u) \right\}_{u \in \{ v \} \, \cup \, N(v) } \right) \]
where $\text{HASH}$ maps different inputs to different colors
After $K$ steps of color refinement, $c^{(K)}(v)$ summarizes the structure of $K$-hop neigborhood.

Example:

page 61-66 from lecture notes 2;
A visual demo of the color refinement algorithm.

4.3.2.2 Weisfeiler-Lehman Graph Features

After color refinement, WL kernel counts number of nodes with a given color.

WL kernel is computationally efficient
- The time complexity for color refinement at each step is linear in $\#(\text{edges})$, since it involves aggregating neighboring colors.
When computing a kernel value, only colors appeared in the two graphs need to be tracked
- Thus, $\#(\text{colors})$ is at most the total number of nodes.
Counting colors takes linear-time w.r.t. $\#(\text{nodes})$.
In total, time complexity is linear in $\#(\text{edges})$.

4.4 Summary

Graphlet Kernel

Graph is represented as Bag-of-graphlets
Computationally expensive

Weisfeiler-Lehman Kernel

Apply $K$-step color refinement algorithm to enrich node colors
Different colors capture different 𝐾-hop neighborhood structures
Graph is represented as Bag-of-colors
Computationally efficient
Closely related to Graph Neural Networks

5. Summary

Traditional ML Pipeline

Hand-crafted feature + ML model

Hand-crafted features for graph data

Node-level:
- Node degree, centrality, clustering coefficient, graphlets
Link-level:
- Distance-based feature
- local/global neighborhood overlap
Graph-level:
- Graphlet kernel, WL kernel

References:

Katz centrality, wikipedia, site

posted @ 2022-07-08 20:20 veager 阅读(241) 评论(0) 收藏举报

刷新页面返回顶部

veager

Machine Learning with Graphs : 2 Traditional Methods for ML on Graphs

1. Introduction

2. Node-Level Tasks and Features

2.1 Node Features: Node Degree

2.2 Node Features: Node Centrality

(1) Eigenvector centrality

(2) Betweenness centrality 介数中心性

代码

(3) Closeness centrality

代码

2.3 Node Features: clustering coefficient

2.4 Node Features: Graphlets

2.4.1 Analogy

2.4.2 Induced Subgraph & Isomorphism

2.5 Summary

3. Link-level Prediction Task and Features

3.1 Link Prediction: Overview

3.1.1 Link Prediction as a Task

3.1.2 Link Prediction via Proximity

3.2 Link-Level Features: Overview

3.2.1 Distance-Based Features

3.2.2 Local Neighborhood Overlap

(1) Common neighbors

代码

(2) Jaccard's coefficient

代码 networkx.jaccard_coefficient(G, ebunch=None)

(3) Adamic-Adar index

代码 networkx.adamic_adar_index(G, ebunch=None)

3.2.3 Global Neighborhood Overlap

(1) Katz index

Katz index of nodes (a.k.a Katz centrality)

3.3 Summary: Link-Level Features

4. Graph-Level Features and Graph Kernels

4.1 Kernel Methods

4.2 Graph-Level Features: Graph Kernel

4.2.1 Overview

4.2.2 Graph Kernel: Key Idea

4.3 Graphlet Features

4.3.1 Graphlet kernel

4.3.2 Weisfeiler-Lehman Kernel

4.3.2.1 Color refinement

4.3.2.2 Weisfeiler-Lehman Graph Features

4.4 Summary

5. Summary

References:

代码 `networkx.jaccard_coefficient(G, ebunch=None)`

代码 `networkx.adamic_adar_index(G, ebunch=None)`