Machine Learning with Graphs : 2 Traditional Methods for ML on Graphs
Stanford CS224W : Machine Learning with Graphs, Fall 2021:
-
2 Traditional Methods for ML on Graphs
1. Introduction
Traditional ML pipeline uses hand-designed features.
Traditional features for:
-
Node-level prediction
-
Link-level prediction
-
Graph-level prediction
Machine Learning in Graphs
Goal : Make predictions for a set of objects
Design choices :
-
Features : \(d\)-dimensional vectors
-
Objects : Nodes, edges, sets of nodes, entire graphs
-
Objective function : What task are we aiming to solve?
Example : node-level prediction
-
Given: \(G = (V, E)\)
-
Learn a function: \(f: V \to \mathbb{R}\)
2. Node-Level Tasks and Features
Goal: Characterize the structure and position of a node in the network:
-
Node degree
-
Node centrality
-
Clustering coefficient
-
Graphlets
2.1 Node Features: Node Degree
-
The degree \(k_v\) of node \(v\) is the number of edges (neighboring nodes) the node has.
-
Treats all neighboring nodes equally.
-
有向图(directed graph): in-degree and out-degree
2.2 Node Features: Node Centrality
Node degree counts the neighboring nodes without capturing their importance.
Node centrality \(c_v\) takes the node importance in a graph into account
-
Different ways to model importance:
-
Eigenvector centrality
-
Betweenness centrality
-
Closeness centrality
-
and many others ...
(1) Eigenvector centrality
A node \(v\) is important if surrounded by important neighboring nodes \(u \in N(v)\)
We model the centrality of node \(v\) as the sum of the centrality of neighboring nodes:
where:
-
\(N(v)\) represents the neighboring node set of the node \(v\),
-
\(\lambda\) is normalization constant (it will turn out to be the largest eigenvalue of \(A\))
Notice that the above equation models centrality in a recursive manner.
Rewrite the recursive equation in the matrix form.
where:
-
\(\boldsymbol{A}\) is adjacency matrix, \(\boldsymbol{A}_{u v} = 1\) if \(u \in N(v)\)
-
\(\boldsymbol{c}\) is centrality vector, \(\boldsymbol{c} = [c_1,c_2,\cdots,c_v, \cdots, c_N]^{\top}\)
-
\(\lambda\) is the eigenvalue
We can see that:
-
centrality \(\boldsymbol{c}\) is the eigenvector of \(\boldsymbol{A}\)
-
The largest eigenvalue \(\lambda_{\max}\) is always positive and unique (by Perron-Frobenius Theorem).
-
The eigenvector \(\boldsymbol{c}_{\max}\) corresponding to \(\lambda_{\max}\) is used for centrality.
(2) Betweenness centrality 介数中心性
A node is important if it lies on many shortest paths between other nodes:
Understand shortest paths between \(s\) and \(t\): there may exist one more shortest path between \(s\) and \(t\).
代码
nx.betweenness_centrality(G, k=None, normalized=True, weight=None, endpoints=False, seed=None)
Parameters:
-
normalized=True, the betweenness values will be normalized by-
\(2/((N-1)(N-2))\) for undirected graphs,
-
and \(1/((N-1)(N-2))\) for directed graphs
-
-
weight (None or string, optional, default=None): the weight used to calculate weighted shortest paths (i.e., distance).- If
None, all edge weights are considered equal.
- If
nx.betweenness_centrality(G, 0) # return a number; the index of node 0
nx.betweenness_centrality(G) # return dict;
(3) Closeness centrality
A node is important if it has small shortest path lengths to all other nodes.
代码
nx.closeness_centrality(G, u=None, distance=None, wf_improved=True) 其计算公式为:
nx.closeness_centrality(G, 0)
nx.closeness_centrality(G)
2.3 Node Features: clustering coefficient
Clustering coefficient: Measures how connected \(v\)'s neighboring nodes are:
where \(\binom{k_v}{2} = \frac{k_v (k_v - 1)}{2}\) is the number of node pairs among \(k_v\) neighboring nodes (thus, for undirected graph, \(k_v\) equals to the degree of node \(v\)).
Observation: clustering coefficient counts the \(\#(\text{triangles})\) in the ego-network
2.4 Node Features: Graphlets
We can generalize clustering coefficient by counting \(\#( \text{pre-specified subgraphs})\) (i.e., graphlets).
Goal: describe network structure around node \(u\)
Graphlets are small subgraphs that describe the structure of node \(u\)'s network neighborhood
2.4.1 Analogy
-
Degree counts \(\#(\text{edges})\) that a node touches
-
Clustering coefficient counts \(\#(\text{triangles})\) that a node touches.
-
Graphlet Degree Vector (GDV): Graphlet-base features for nodes; GDV counts \(\#(\text{graphlets})\) that a node touches
Considering graphlets of size 2-5 nodes we get:
- Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood
Graphlet degree vector provides a measure of a node's local network topology:
- Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient.
2.4.2 Induced Subgraph & Isomorphism
Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset.
Graph Isomorphism (图同构): Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic.
Graphlets : Rooted connected induced non-isomorphic subgraphs.
Graphlet Degree Vector (GDV) : A count vector of graphlets rooted at a given node.
2.5 Summary
Importance-based features : capture the importance of a node in a graph
-
Node degree: Simply counts the number of neighboring nodes
-
Different node centrality measures
-
Models importance of neighboring nodes in a graph
-
Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality
-
Structure-based features : Capture topological properties of local neighborhood around a node.
-
Node degree : Counts the number of neighboring nodes
-
Clustering coefficient : Measures how connected neighboring nodes are
-
Graphlet count vector : Counts the occurrences of different graphlets
3. Link-level Prediction Task and Features
3.1 Link Prediction: Overview
Link-level prediction task is to predict new links based on the existing links.
At test time, node pairs (with no existing links) are ranked, and top \(K\) node pairs are predicted.
The key is to design features for a pair of nodes.
3.1.1 Link Prediction as a Task
Two formulations of the link prediction task:
-
Links missing at random : Remove a random set of links and then aim to predict them
-
Links over time : Given \(G[t_0, t_0']\) a graph defined by edges up to time \(t_0'\), output a ranked list \(L\) of edges (not in \(G[t_0, t_0']\)) that are predicted to appear in time \(G[t_1, t_1']\)
Evaluation :
-
\(n = |E_{\text{new}}|\): # new edges that appear during the test period \([t_1, t_1']\)
-
Take top \(n\) elements of \(L\) and count correct edges
3.1.2 Link Prediction via Proximity
Methodology:
-
For each pair of nodes \((x,y)\) compute score \(c(x,y)\)
- For example, \(c(x,y)\) could be the # of common neighbors of \(x\) and \(y\)
-
Sort pairs \((x,y)\) by the decreasing score \(c(x,y)\)
-
Predict top \(n\) pairs as new links
-
See which of these links actually appear in \(G[t_1, t_1']\)
3.2 Link-Level Features: Overview
-
Distance-based feature
-
Local neighborhood overlap
-
Global neighborhood overlap
3.2.1 Distance-Based Features
Shortest-path distance between two nodes.
However, this does not capture the degree of neighborhood overlap
3.2.2 Local Neighborhood Overlap
Captures # neighboring nodes shared between two nodes \(v_1\) and \(v_2\)
Limitation of local neighborhood features
- Metric is always zero if the two nodes do not have
any neighbors in common
(1) Common neighbors
where \(N(v_1)\) represents the set of nodes adjacent to node \(v_1\).
代码
networkx.common_neighbors(G, u, v) Returns the common neighbors
networkx.common_neighbor_centrality(G, ebunch=None, alpha=0.8): CCPA score, the equation is
-
where \(N (u) \cap N (v)\) denotes the intersection of \(u\)'s neighbors and \(v\)'s neighbors
-
\(\alpha\) is parameter varies between \([0,1]\)
-
\(N\) is the total number of nodes
-
\(d_{uv}\) is the shortest distance between node \(u\) and \(v\)
实例: Common neighbors
点击查看代码
def common_neighbor_(G):
'''
Params:
-------
G : graph
Returns:
--------
cn : dict
key is a 2-tuples (u, v); value is the common neighbor index
'''
cn = {}
for edge in G.edges:
u, v = edge
neigh = nx.common_neighbors(G, u, v)
cn[(u, v)] = len(list(neigh))
return cn
# ==================================================================
# compute the CN index of edge (0, 1)
cn = nx.common_neighbor_centrality(G, alpha=1., ebunch=[(0, 1)])
print(list(cn))
# compute the CN index of all edges
cn1 = nx.common_neighbor_centrality(G, alpha=1.)
print(list(sorted(cn1)))
cn2 = common_neighbor_(G)
print(cn2)
# test
total_error = 0.
for edge in G.edges:
cn1 = nx.common_neighbor_centrality(G, alpha=1., ebunch=[edge])
cn1 = list(cn1)[0][2]
delta = abs(cn2[edge] - cn1)
total_error = total_error + delta
print(total_error)
(2) Jaccard's coefficient
代码 networkx.jaccard_coefficient(G, ebunch=None)
实例: Jaccard's coefficient
点击查看代码
def jaccard_coefficient_(G):
'''
Params:
-------
G : graph
Returns:
--------
jc : dict
key is a 2-tuples (u, v); value is the Jaccard's coefficient
'''
jc = {}
for edge in G.edges:
u, v = edge
neigh_u = set(G.neighbors(u))
neigh_v = set(G.neighbors(v))
jc[(u, v)] = len(neigh_u.intersection(neigh_v)) / len(neigh_u.union(neigh_v))
return jc
# ============================================================================
# compute the JC index of edge (0, 1)
jc = nx.jaccard_coefficient(G, ebunch=[(0, 1)])
print(list(jc))
# compute the JC index of all edges
jc1 = nx.jaccard_coefficient(G)
print(list(jc1))
jc2 = jaccard_coefficient_(G)
print(jc2)
# test
total_error = 0.
for edge in G.edges:
jc1 = nx.jaccard_coefficient(G, ebunch=[edge])
jc1 = list(jc1)[0][2]
delta = abs(jc2[edge] - jc1)
total_error = total_error + delta
print(total_error)
(3) Adamic-Adar index
where \(k_u\) represents the number of neighbors of node \(u\), i.e. \(k_u = |N(u)|\)
代码 networkx.adamic_adar_index(G, ebunch=None)
实例: Adamic-Adar index
点击查看代码
def adamic_adar_index_(G):
'''
Params:
-------
G : graph
Returns:
--------
aai : dict
key is a 2-tuples (u, v); value is the Adamic-Adar index
'''
aai = {}
for edge in G.edges:
u, v = edge
common_neigh = list(nx.common_neighbors(G, u, v))
if len(common_neigh) == 0.:
aai[(u, v)] = 0.
else:
aai_node = [1 / np.log(len(list(G.neighbors(node)))) for node in common_neigh]
aai[(u, v)] = sum(aai_node)
return aai
# ================================================================================
# compute the AA index of edge (0, 1)
aai = nx.adamic_adar_index(G, ebunch=[edge])
print(list(aai))
# compute the AA index of all edges
aai1 = nx.adamic_adar_index(G)
print(list(aai1))
aai2 = adamic_adar_index_(G)
print(aai2)
total_error = 0.
for edge in G.edges:
aai1 = nx.adamic_adar_index(G, ebunch=[edge])
aai1 = list(aai1)[0][2]
delta = abs(aai2[edge] - aai1)
total_error = total_error + delta
print(total_error)
3.2.3 Global Neighborhood Overlap
Global neighborhood overlap metrics resolve the limitation by considering the entire graph.
(1) Katz index
Katz index : count the number of walks (\(\#\text{walk}\)) of all lengths between a given pair of nodes.
Powers of the graph adjacency matrix : Computing #walks between two nodes
-
\(A_{uv} = 1\) if \(u \in N(v)\)
-
Let \(P_{uv}^{(K)} = \# \text{walks}\) of length \(K\) between \(u\) and \(v\). \(P_{uv}^{(K)}\) 表示点 \(u\) 到点 \(v\) 经过 \(K\) 步达到的路径(方案)个数,参考 Floyd 算法。
-
\(P^{(K)} = A^k\)
Compute \(P_{uv}^{(2)}\):
-
Step 1 : Compute \(\# \text{walk}\) of length 1 between each of \(u\)'s neighbor and \(v\)
-
Step 2 : Sum up these \(\# \text{walks}\) across \(u\)'s neighbors
- \(P^{(2)}_{uv} = \sum \limits_i A_{ui} \cdot P_{iv}^{(1)} = \sum \limits_i A_{ui} \cdot A_{iv} = A^2_{u v}\)
Compute \(\#\text{walk}\) between two nodes: use adjacency matrix powers
-
\(A_{uv}\) specifies \(\#\text{walk}\) of length \(1\) (direct neighborhood) between \(u\) and \(v\).
-
\(A_{uv}^{2}\) specifies \(\#\text{walk}\) of length \(2\) (neighbor of neighbor) between \(u\) and \(v\)
-
\(A_{uv}^{l}\) specifies \(\#\text{walk}\) of length \(l\)
Katz index between \(v_1\) and \(v_2\) (from node \(v_1\) to node \(v_2\)) is calculated as sum over all walk lengths
where \(A_{v_1 v_2}^{l}\) is the \(\#(\text{walk})\) of length \(l\) between \(v_1\) and \(v_2\); \(0 < \beta < 1\) is the discount factor.
Katz index matrix is computed in closed-form:
where \((I - \beta A)^{-1} = \sum \limits_{i=0}^{\infty} \beta^i A^i\) by geometric series of matrices (Hamilton, 2020, p.p. 18):
Katz index of nodes (a.k.a Katz centrality)
- For a directed graph, it indicates the incoming edges (i.e., from all other nodes to node \(i\)):
matrix form
where \(\boldsymbol{1}\) is the all-one vector.
3.3 Summary: Link-Level Features
Distance-based features:
- Uses the shortest path length between two nodes but does not capture how neighborhood overlaps.
Local neighborhood overlap:
-
Captures how many neighboring nodes are shared by two nodes.
-
Becomes zero when no neighbor nodes are shared.
Global neighborhood overlap:
-
Uses global graph structure to score two nodes.
-
Katz index counts \(\#\text{walks}\) of all lengths between two nodes
4. Graph-Level Features and Graph Kernels
4.1 Kernel Methods
Kernel methods are widely-used for traditional ML for graph-level prediction.
Idea : Design kernels instead of feature vectors.
A quick introduction to Kernels:
-
Kernel \(\kappa(G, G') \in \mathbb{R}\) measures similarity b/w data
-
Kernel matrix \(\mathbf{K} = [\kappa(G, G')]_{G, G'}\) must always be positive semidefinite (i.e., has positive eigenvalues)
-
There exists a feature representation \(\phi(\cdot)\) such that \(\kappa(G, G') = \phi(G)^{\top} \phi(G')\)
-
Once the kernel is defined, off-the-shelf ML model, such as kernel SVM, can be used to make predictions.
4.2 Graph-Level Features: Graph Kernel
Goal : We want features that characterize the structure of an entire graph
4.2.1 Overview
Graph Kernels : Measure similarity between two graphs:
-
Graphlet Kernel: Shervashidze, Nino, et al. "Efficient graphlet kernels for large graph comparison." Artificial Intelligence and Statistics. 2009
-
Weisfeiler-Lehman Kernel: Shervashidze, Nino, et al. "Weisfeiler-lehman graph kernels." Journal of Machine Learning Research 12.9 (2011).
-
Other kernels
-
Random-walk kernel
-
Shortest-path graph kernel
-
And many more ...
-
4.2.2 Graph Kernel: Key Idea
Goal : Design graph feature vector \(\phi(G)\)
Key idea : Bag-of-Words (BoW) for a graph
-
Bag-of-Words (BoW): simply uses the word counts as features for documents (no ordering considered).
-
Naïve extension to a graph: Regard nodes as words.
-
Since both graphs have 4 red nodes, we get the
same feature vector for two different graphs.
Bag of node degrees
Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees!
4.3 Graphlet Features
Key idea : Count the number of different graphlets in a graph.
Graphlet : Note that the definition of graphlets here is slightly different from node-level features.
-
The two differences are:
-
Nodes in graphlets here do not need to be connected (allows for isolated nodes)
-
The graphlets here are not rooted.
-
-
Examples: Let \(\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})\) be a list of graphlets of size \(k\)
-
For \(k=3\), there are 4 graphlets
-
For \(k=4\), there are 11 graphlets
-
Graphlet count vector : Given graph \(G\), and a graphlet list \(\mathcal{G}_{k} = (g_1, g_2, \cdots, g_{n_k})\), define the graphlet count vector \(\boldsymbol{f}_G \in \mathbb{R}^{n_k}\) as:
- Example:
4.3.1 Graphlet kernel
Graphlet kernel : Given two graphs, \(G\) and \(G'\), graphlet kernel is computed as:
Problem: if \(G\) and \(G'\) have different sizes, that will greatly skew the value.
Solution: normalize each feature vector
Limitations: Counting graphlets is expensive!
4.3.2 Weisfeiler-Lehman Kernel
Goal : Design an efficient graph feature descriptor \(\phi(G)\)
Idea : Use neigborhood structure to iteratively enrich node vocabulary.
- Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information.
Algorithm to achieve this: Color refinement
4.3.2.1 Color refinement
Given : A graph \(G\) with a set of nodes \(V\).
-
Assign an initial color \(c^{(0)}(v)\) to each node \(v\)
-
Iteratively refine node colors by
\[c^{(k+1)}(v) = \text{HASH} \left(\left\{c^{(k)}(v),\left\{c^{(k)}(u) \right\}_{u \in N(v)}\right\}\right) = \text{HASH} \left(\left\{c^{(k)}(u) \right\}_{u \in \{ v \} \, \cup \, N(v) } \right) \]where \(\text{HASH}\) maps different inputs to different colors
-
After \(K\) steps of color refinement, \(c^{(K)}(v)\) summarizes the structure of \(K\)-hop neigborhood.
Example:
-
page 61-66 from lecture notes 2;
4.3.2.2 Weisfeiler-Lehman Graph Features
After color refinement, WL kernel counts number of nodes with a given color.
-
WL kernel is computationally efficient
- The time complexity for color refinement at each step is linear in \(\#(\text{edges})\), since it involves aggregating neighboring colors.
-
When computing a kernel value, only colors appeared in the two graphs need to be tracked
- Thus, \(\#(\text{colors})\) is at most the total number of nodes.
-
Counting colors takes linear-time w.r.t. \(\#(\text{nodes})\).
-
In total, time complexity is linear in \(\#(\text{edges})\).
4.4 Summary
Graphlet Kernel
-
Graph is represented as Bag-of-graphlets
-
Computationally expensive
Weisfeiler-Lehman Kernel
-
Apply \(K\)-step color refinement algorithm to enrich node colors
-
Different colors capture different 𝐾-hop neighborhood structures
-
Graph is represented as Bag-of-colors
-
Computationally efficient
-
Closely related to Graph Neural Networks
5. Summary
Traditional ML Pipeline
- Hand-crafted feature + ML model
Hand-crafted features for graph data
-
Node-level:
- Node degree, centrality, clustering coefficient, graphlets
-
Link-level:
-
Distance-based feature
-
local/global neighborhood overlap
-
-
Graph-level:
- Graphlet kernel, WL kernel
References:
Katz centrality, wikipedia, site

浙公网安备 33010602011771号