Machine Learning with Graphs : 2 Traditional Methods for ML on Graphs

Stanford CS224W : Machine Learning with Graphs, Fall 2021:

  • 2 Traditional Methods for ML on Graphs

  • site

1. Introduction

Traditional ML pipeline uses hand-designed features.

Traditional features for:

  • Node-level prediction

  • Link-level prediction

  • Graph-level prediction

Machine Learning in Graphs

Goal : Make predictions for a set of objects

Design choices :

  • Features : \(d\)-dimensional vectors

  • Objects : Nodes, edges, sets of nodes, entire graphs

  • Objective function : What task are we aiming to solve?

Example : node-level prediction

  • Given: \(G = (V, E)\)

  • Learn a function: \(f: V \to \mathbb{R}\)

2. Node-Level Tasks and Features

Goal: Characterize the structure and position of a node in the network:

  • Node degree

  • Node centrality

  • Clustering coefficient

  • Graphlets

2.1 Node Features: Node Degree

  • The degree \(k_v\) of node \(v\) is the number of edges (neighboring nodes) the node has.

  • Treats all neighboring nodes equally.

  • 有向图(directed graph): in-degree and out-degree

2.2 Node Features: Node Centrality

Node degree counts the neighboring nodes without capturing their importance.

Node centrality \(c_v\) takes the node importance in a graph into account

  • Different ways to model importance:

  • Eigenvector centrality

  • Betweenness centrality

  • Closeness centrality

  • and many others ...

(1) Eigenvector centrality

A node \(v\) is important if surrounded by important neighboring nodes \(u \in N(v)\)

We model the centrality of node \(v\) as the sum of the centrality of neighboring nodes:

\[c_v = \frac{1}{\lambda} \sum_{u \in N(v)} c_u \]

where:

  • \(N(v)\) represents the neighboring node set of the node \(v\),

  • \(\lambda\) is normalization constant (it will turn out to be the largest eigenvalue of \(A\))

Notice that the above equation models centrality in a recursive manner.

Rewrite the recursive equation in the matrix form.

\[\lambda \ \boldsymbol{c} = \boldsymbol{A c} \]

where:

  • \(\boldsymbol{A}\) is adjacency matrix, \(\boldsymbol{A}_{u v} = 1\) if \(u \in N(v)\)

  • \(\boldsymbol{c}\) is centrality vector, \(\boldsymbol{c} = [c_1,c_2,\cdots,c_v, \cdots, c_N]^{\top}\)

  • \(\lambda\) is the eigenvalue

We can see that:

  • centrality \(\boldsymbol{c}\) is the eigenvector of \(\boldsymbol{A}\)

  • The largest eigenvalue \(\lambda_{\max}\) is always positive and unique (by Perron-Frobenius Theorem).

  • The eigenvector \(\boldsymbol{c}_{\max}\) corresponding to \(\lambda_{\max}\) is used for centrality.

(2) Betweenness centrality 介数中心性

A node is important if it lies on many shortest paths between other nodes:

\[c_v = \sum_{s \ne v \ne t} \frac{ \#(\text{shortest paths between $s$ and $t$ that contain $v$}) }{ \#(\text{shortest paths between $s$ and $t$}) } = \sum_{s, t \in V \text{ and } s \ne t} \frac{\#(s, t|v)}{\#(s, t)} \]

Understand shortest paths between \(s\) and \(t\): there may exist one more shortest path between \(s\) and \(t\).

代码

nx.betweenness_centrality(G, k=None, normalized=True, weight=None, endpoints=False, seed=None)

Parameters:

  • normalized=True, the betweenness values will be normalized by

    • \(2/((N-1)(N-2))\) for undirected graphs,

    • and \(1/((N-1)(N-2))\) for directed graphs

  • weight (None or string, optional, default=None): the weight used to calculate weighted shortest paths (i.e., distance).

    • If None, all edge weights are considered equal.
nx.betweenness_centrality(G, 0)  # return a number; the index of node 0
nx.betweenness_centrality(G)     # return dict; 

(3) Closeness centrality

A node is important if it has small shortest path lengths to all other nodes.

\[c_v = \frac{1}{\sum \limits_{u \ne v} \text{shortest path length between $u$ and $v$}} \]

代码

nx.closeness_centrality(G, u=None, distance=None, wf_improved=True) 其计算公式为:

\[c_v = \frac{N - 1}{\sum \limits_{u \in V, u \neq v} d(v, u)} \]

nx.closeness_centrality(G, 0)
nx.closeness_centrality(G)

2.3 Node Features: clustering coefficient

Clustering coefficient: Measures how connected \(v\)'s neighboring nodes are:

\[e_v = \frac{ \# (\text{ edges among neighboring nodes}) }{ \binom{k_v}{2} } \in [0, 1] \]

where \(\binom{k_v}{2} = \frac{k_v (k_v - 1)}{2}\) is the number of node pairs among \(k_v\) neighboring nodes (thus, for undirected graph, \(k_v\) equals to the degree of node \(v\)).

Observation: clustering coefficient counts the \(\#(\text{triangles})\) in the ego-network

2.4 Node Features: Graphlets

We can generalize clustering coefficient by counting \(\#( \text{pre-specified subgraphs})\) (i.e., graphlets).

Goal: describe network structure around node \(u\)

Graphlets are small subgraphs that describe the structure of node \(u\)'s network neighborhood

2.4.1 Analogy

  • Degree counts \(\#(\text{edges})\) that a node touches

  • Clustering coefficient counts \(\#(\text{triangles})\) that a node touches.

  • Graphlet Degree Vector (GDV): Graphlet-base features for nodes; GDV counts \(\#(\text{graphlets})\) that a node touches

Considering graphlets of size 2-5 nodes we get:

  • Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood

Graphlet degree vector provides a measure of a node's local network topology:

  • Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.

2.4.2 Induced Subgraph & Isomorphism

Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.

Graph Isomorphism (图同构): Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.

Graphlets : Rooted connected induced non-isomorphic subgraphs.

Graphlet Degree Vector (GDV) : A count vector of graphlets rooted at a given node.

2.5 Summary

Importance-based features : capture the importance of a node in a graph

  • Node degree: Simply counts the number of neighboring nodes

  • Different node centrality measures

    • Models importance of neighboring nodes in a graph

    • Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality

Structure-based features : Capture topological properties of local neighborhood around a node.

  • Node degree : Counts the number of neighboring nodes

  • Clustering coefficient : Measures how connected neighboring nodes are

  • Graphlet count vector : Counts the occurrences of different graphlets

3. Link-level Prediction Task and Features

Link-level prediction task is to predict new links based on the existing links.

At test time, node pairs (with no existing links) are ranked, and top \(K\) node pairs are predicted.

The key is to design features for a pair of nodes.

Two formulations of the link prediction task:

  • Links missing at random : Remove a random set of links and then aim to predict them

  • Links over time : Given \(G[t_0, t_0']\) a graph defined by edges up to time \(t_0'\), output a ranked list \(L\) of edges (not in \(G[t_0, t_0']\)) that are predicted to appear in time \(G[t_1, t_1']\)

Evaluation :

  • \(n = |E_{\text{new}}|\): # new edges that appear during the test period \([t_1, t_1']\)

  • Take top \(n\) elements of \(L\) and count correct edges

Methodology:

  • For each pair of nodes \((x,y)\) compute score \(c(x,y)\)

    • For example, \(c(x,y)\) could be the # of common neighbors of \(x\) and \(y\)
  • Sort pairs \((x,y)\) by the decreasing score \(c(x,y)\)

  • Predict top \(n\) pairs as new links

  • See which of these links actually appear in \(G[t_1, t_1']\)

  • Distance-based feature

  • Local neighborhood overlap

  • Global neighborhood overlap

3.2.1 Distance-Based Features

Shortest-path distance between two nodes.

However, this does not capture the degree of neighborhood overlap

3.2.2 Local Neighborhood Overlap

Captures # neighboring nodes shared between two nodes \(v_1\) and \(v_2\)

Limitation of local neighborhood features

  • Metric is always zero if the two nodes do not have
    any neighbors in common

(1) Common neighbors

\[|N(v_1) \cap N(v_2)| \]

where \(N(v_1)\) represents the set of nodes adjacent to node \(v_1\).

代码

networkx.common_neighbors(G, u, v) Returns the common neighbors

networkx.common_neighbor_centrality(G, ebunch=None, alpha=0.8): CCPA score, the equation is

\[\alpha \cdot \left( \left| N (u) \cap N (v) \right| \right) + (1 - \alpha ) \cdot \frac{N}{{d}_{uv}} \]

  • where \(N (u) \cap N (v)\) denotes the intersection of \(u\)'s neighbors and \(v\)'s neighbors

  • \(\alpha\) is parameter varies between \([0,1]\)

  • \(N\) is the total number of nodes

  • \(d_{uv}\) is the shortest distance between node \(u\) and \(v\)

实例: Common neighbors

点击查看代码
def common_neighbor_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    cn : dict
         key is a 2-tuples (u, v); value is the common neighbor index
    '''
    cn = {}
    for edge in G.edges:
        u, v = edge
        neigh = nx.common_neighbors(G, u, v)
        cn[(u, v)] = len(list(neigh))
    return cn
# ==================================================================

# compute the CN index of edge (0, 1)
cn = nx.common_neighbor_centrality(G, alpha=1., ebunch=[(0, 1)])
print(list(cn))
# compute the CN index of all edges
cn1 = nx.common_neighbor_centrality(G, alpha=1.)
print(list(sorted(cn1)))

cn2 = common_neighbor_(G)
print(cn2)

# test
total_error = 0.
for edge in G.edges:
    cn1 = nx.common_neighbor_centrality(G, alpha=1., ebunch=[edge])
    cn1 = list(cn1)[0][2]
    delta = abs(cn2[edge] - cn1)
    total_error = total_error + delta
print(total_error)

(2) Jaccard's coefficient

\[\frac{ \left| N (v_{1}) \cap N (v_{2}) \right| }{ \left |N (v_{1}) \cup N (v_{2}) \right|} \]

代码 networkx.jaccard_coefficient(G, ebunch=None)

实例: Jaccard's coefficient

点击查看代码
def jaccard_coefficient_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    jc : dict
         key is a 2-tuples (u, v); value is the Jaccard's coefficient
    '''
    jc = {}
    for edge in G.edges:
        u, v = edge
        neigh_u = set(G.neighbors(u))
        neigh_v = set(G.neighbors(v))
        jc[(u, v)] = len(neigh_u.intersection(neigh_v)) / len(neigh_u.union(neigh_v))
    return jc
# ============================================================================

# compute the JC index of edge (0, 1)
jc = nx.jaccard_coefficient(G, ebunch=[(0, 1)])
print(list(jc))
# compute the JC index of all edges
jc1 = nx.jaccard_coefficient(G)
print(list(jc1))

jc2 = jaccard_coefficient_(G)
print(jc2)

# test
total_error = 0.
for edge in G.edges:
    jc1 = nx.jaccard_coefficient(G, ebunch=[edge])
    jc1 = list(jc1)[0][2]
    delta = abs(jc2[edge] - jc1)
    total_error = total_error + delta
print(total_error)

(3) Adamic-Adar index

\[\sum_{u \in N\left(v_{1}\right) \cap N\left(v_{2}\right)} \frac{1}{\log \left(k_{u}\right)} = \sum_{u \in N\left(v_{1}\right) \cap N\left(v_{2}\right)} \frac{1}{\log \left( |N(u)| \right)} \]

where \(k_u\) represents the number of neighbors of node \(u\), i.e. \(k_u = |N(u)|\)

代码 networkx.adamic_adar_index(G, ebunch=None)

实例: Adamic-Adar index

点击查看代码
def adamic_adar_index_(G):
    '''
    Params:
    -------
    G : graph

    Returns:
    --------
    aai : dict
          key is a 2-tuples (u, v); value is the Adamic-Adar index
    '''
    aai = {}
    for edge in G.edges:
        u, v = edge
        common_neigh = list(nx.common_neighbors(G, u, v))
        if len(common_neigh) == 0.:
            aai[(u, v)] = 0.
        else:
            aai_node = [1 / np.log(len(list(G.neighbors(node)))) for node in common_neigh]
            aai[(u, v)] = sum(aai_node)
    return aai
# ================================================================================

# compute the AA index of edge (0, 1)
aai = nx.adamic_adar_index(G, ebunch=[edge])
print(list(aai))
# compute the AA index of all edges
aai1 = nx.adamic_adar_index(G)
print(list(aai1))

aai2 = adamic_adar_index_(G)
print(aai2)

total_error = 0.
for edge in G.edges:
    aai1 = nx.adamic_adar_index(G, ebunch=[edge])
    aai1 = list(aai1)[0][2]
    delta = abs(aai2[edge] - aai1)
    total_error = total_error + delta
print(total_error)

3.2.3 Global Neighborhood Overlap

Global neighborhood overlap metrics resolve the limitation by considering the entire graph.

(1) Katz index

Katz index : count the number of walks (\(\#\text{walk}\)) of all lengths between a given pair of nodes.

Powers of the graph adjacency matrix : Computing #walks between two nodes

  • \(A_{uv} = 1\) if \(u \in N(v)\)

  • Let \(P_{uv}^{(K)} = \# \text{walks}\) of length \(K\) between \(u\) and \(v\). \(P_{uv}^{(K)}\) 表示点 \(u\) 到点 \(v\) 经过 \(K\) 步达到的路径(方案)个数,参考 Floyd 算法。

  • \(P^{(K)} = A^k\)

Compute \(P_{uv}^{(2)}\):

  • Step 1 : Compute \(\# \text{walk}\) of length 1 between each of \(u\)'s neighbor and \(v\)

  • Step 2 : Sum up these \(\# \text{walks}\) across \(u\)'s neighbors

    • \(P^{(2)}_{uv} = \sum \limits_i A_{ui} \cdot P_{iv}^{(1)} = \sum \limits_i A_{ui} \cdot A_{iv} = A^2_{u v}\)

Compute \(\#\text{walk}\) between two nodes: use adjacency matrix powers

  • \(A_{uv}\) specifies \(\#\text{walk}\) of length \(1\) (direct neighborhood) between \(u\) and \(v\).

  • \(A_{uv}^{2}\) specifies \(\#\text{walk}\) of length \(2\) (neighbor of neighbor) between \(u\) and \(v\)

  • \(A_{uv}^{l}\) specifies \(\#\text{walk}\) of length \(l\)

Katz index between \(v_1\) and \(v_2\) (from node \(v_1\) to node \(v_2\)) is calculated as sum over all walk lengths

\[S_{v_1 v_2} = \sum_{l=1}^{\infty} \beta^l \, A_{v_1 v_2}^{l} \]

where \(A_{v_1 v_2}^{l}\) is the \(\#(\text{walk})\) of length \(l\) between \(v_1\) and \(v_2\); \(0 < \beta < 1\) is the discount factor.

Katz index matrix is computed in closed-form:

\[S = \sum_{i=1}^{\infty} \beta^i A^i = (I - \beta A)^{-1} - I \]

where \((I - \beta A)^{-1} = \sum \limits_{i=0}^{\infty} \beta^i A^i\) by geometric series of matrices (Hamilton, 2020, p.p. 18):

\[\sum_{i=0}^{\infty} A^0 = I + \beta A + \beta^2 A^2 + \ldots +\beta^{\infty} A^{\infty} = (I - \beta A)^{-1} \]

Katz index of nodes (a.k.a Katz centrality)
  • For a directed graph, it indicates the incoming edges (i.e., from all other nodes to node \(i\)):

\[x_i = \sum_{l=1}^{\infty} \sum_{j=1}^{N} \beta^l (A^l)_{ji} = \sum_{j=1}^{N} S_{ji} \]

matrix form

\[\boldsymbol{x} = ((I - \beta A^\top)^{-1} - I) \boldsymbol{1} = (I - \beta A^\top)^{-1} \boldsymbol{1} - \boldsymbol{1} \]

where \(\boldsymbol{1}\) is the all-one vector.

Distance-based features:

  • Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.

Local neighborhood overlap:

  • Captures how many neighboring nodes are shared by two nodes.

  • Becomes zero when no neighbor nodes are shared.

Global neighborhood overlap:

  • Uses global graph structure to score two nodes.

  • Katz index counts \(\#\text{walks}\) of all lengths between two nodes

4. Graph-Level Features and Graph Kernels

4.1 Kernel Methods

Kernel methods are widely-used for traditional ML for graph-level prediction.

Idea : Design kernels instead of feature vectors.

A quick introduction to Kernels:

  • Kernel \(\kappa(G, G') \in \mathbb{R}\) measures similarity b/w data

  • Kernel matrix \(\mathbf{K} = [\kappa(G, G')]_{G, G'}\) must always be positive semidefinite (i.e., has positive eigenvalues)

  • There exists a feature representation \(\phi(\cdot)\) such that \(\kappa(G, G') = \phi(G)^{\top} \phi(G')\)

  • Once the kernel is defined, off-the-shelf ML model, such as kernel SVM, can be used to make predictions.

4.2 Graph-Level Features: Graph Kernel

Goal : We want features that characterize the structure of an entire graph

4.2.1 Overview

Graph Kernels : Measure similarity between two graphs:

  • Graphlet Kernel: Shervashidze, Nino, et al. "Efficient graphlet kernels for large graph comparison." Artificial Intelligence and Statistics. 2009

  • Weisfeiler-Lehman Kernel: Shervashidze, Nino, et al. "Weisfeiler-lehman graph kernels." Journal of Machine Learning Research 12.9 (2011).

  • Other kernels

    • Random-walk kernel

    • Shortest-path graph kernel

    • And many more ...

4.2.2 Graph Kernel: Key Idea

Goal : Design graph feature vector \(\phi(G)\)

Key idea : Bag-of-Words (BoW) for a graph

  • Bag-of-Words (BoW): simply uses the word counts as features for documents (no ordering considered).

  • Naïve extension to a graph: Regard nodes as words.

  • Since both graphs have 4 red nodes, we get the
    same feature vector for two different graphs.

Bag of node degrees

Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!

4.3 Graphlet Features

Key idea : Count the number of different graphlets in a graph.

Graphlet : Note that the definition of graphlets here is slightly different from node-level features.

  • The two differences are:

    • Nodes in graphlets here do not need to be connected (allows for isolated nodes)

    • The graphlets here are not rooted.

  • Examples: Let \(\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})\) be a list of graphlets of size \(k\)

    • For \(k=3\), there are 4 graphlets

    • For \(k=4\), there are 11 graphlets

Graphlet count vector : Given graph \(G\), and a graphlet list \(\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})\), define the graphlet count vector \(\boldsymbol{f}_G \in \mathbb{R}^{n_k}\) as:

\[(\boldsymbol{f}_G)_i = \#(g_i \subseteq G), \quad \forall i = 1,2,\cdots, n_k \]

  • Example:

4.3.1 Graphlet kernel

Graphlet kernel : Given two graphs, \(G\) and \(G'\), graphlet kernel is computed as:

\[\kappa(G, G') = \boldsymbol{f}_G^{\top} \, \boldsymbol{f}_{G'} \]

Problem: if \(G\) and \(G'\) have different sizes, that will greatly skew the value.

Solution: normalize each feature vector

\[\kappa(G, G') = \boldsymbol{h}_G^{\top} \, \boldsymbol{h}_{G'}, \qquad \text{where} \ \boldsymbol{h}_G = \frac{\boldsymbol{f}_G}{\text{Sum}(\boldsymbol{f}_G)} \]

Limitations: Counting graphlets is expensive!

4.3.2 Weisfeiler-Lehman Kernel

Goal : Design an efficient graph feature descriptor \(\phi(G)\)

Idea : Use neigborhood structure to iteratively enrich node vocabulary.

  • Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.

Algorithm to achieve this: Color refinement

4.3.2.1 Color refinement

Given : A graph \(G\) with a set of nodes \(V\).

  • Assign an initial color \(c^{(0)}(v)\) to each node \(v\)

  • Iteratively refine node colors by

    \[c^{(k+1)}(v) = \text{HASH} \left(\left\{c^{(k)}(v),\left\{c^{(k)}(u) \right\}_{u \in N(v)}\right\}\right) = \text{HASH} \left(\left\{c^{(k)}(u) \right\}_{u \in \{ v \} \, \cup \, N(v) } \right) \]

    where \(\text{HASH}\) maps different inputs to different colors

  • After \(K\) steps of color refinement, \(c^{(K)}(v)\) summarizes the structure of \(K\)-hop neigborhood.

Example:

4.3.2.2 Weisfeiler-Lehman Graph Features

After color refinement, WL kernel counts number of nodes with a given color.

  • WL kernel is computationally efficient

    • The time complexity for color refinement at each step is linear in \(\#(\text{edges})\), since it involves aggregating neighboring colors.
  • When computing a kernel value, only colors appeared in the two graphs need to be tracked

    • Thus, \(\#(\text{colors})\) is at most the total number of nodes.
  • Counting colors takes linear-time w.r.t. \(\#(\text{nodes})\).

  • In total, time complexity is linear in \(\#(\text{edges})\).

4.4 Summary

Graphlet Kernel

  • Graph is represented as Bag-of-graphlets

  • Computationally expensive

Weisfeiler-Lehman Kernel

  • Apply \(K\)-step color refinement algorithm to enrich node colors

  • Different colors capture different 𝐾-hop neighborhood structures

  • Graph is represented as Bag-of-colors

  • Computationally efficient

  • Closely related to Graph Neural Networks

5. Summary

Traditional ML Pipeline

  • Hand-crafted feature + ML model

Hand-crafted features for graph data

  • Node-level:

    • Node degree, centrality, clustering coefficient, graphlets
  • Link-level:

    • Distance-based feature

    • local/global neighborhood overlap

  • Graph-level:

    • Graphlet kernel, WL kernel

References:

Katz centrality, wikipedia, site

posted @ 2022-07-08 20:20  veager  阅读(234)  评论(0)    收藏  举报