Cluster Analysis - Similarities and Measurements
1. 样本相似性度量 Similarity Measurement
1.1 定义
1.1.1 相似度与距离
-
相似度(Similarity):衡量两个样本的相似程度
-
距离(Distance)或不相似度(dissimilarity):距离越小,表示相似度越大
1.1.2 符号含义
-
样本 \(\boldsymbol{x}_i, \boldsymbol{x}_j \in \mathcal{X} \subseteq \mathbb{R}^n\)
\(\boldsymbol{x}_i = \left[x_{i,1}, \cdots, x_{i,k}, \cdots, x_{i,K} \right]^{\top}\)
-
样本属性(特征):\(k=1,2,\cdots,K\)
1.2 距离度量 Distance measure
1.2.1 距离度量的基本性质
对于一个定义的“距离度量(distance measure)”,应当满足以下的基本性质:
-
非负性: \(d(\boldsymbol{x}_i) \geq 0\)
-
同一性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) = 0\) 当切仅当 \(\boldsymbol{x}_i = \boldsymbol{x}_j\)
-
对称性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) = d(\boldsymbol{x}_j, \boldsymbol{x}_i)\)
-
直递性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) \leq d(\boldsymbol{x}_i, \boldsymbol{x}_k) + d(\boldsymbol{x}_k, \boldsymbol{x}_j)\)
1.2.2 闵可夫斯基距离 Minkowski Distance
闵可夫斯基距离 Minkowski Distance: 样本 \(\boldsymbol{x}_i\) 和 样本 \(\boldsymbol{x}_j\) 的闵可夫斯基距离(Minkowski Distance)定义为:
其中 \(p\geq1\)。
-
当 \(p=1\) 时称为曼哈顿距离(Manhattan Distance)
-
当 \(p=2\) 时称为欧式距离(Euclidean Distance)
-
当 \(p=+\infty\) 时称为切比雪夫距离(Chebyshev Distance):\(d_{ij} = \max \limits_k |x_{i,k} - x_{j,k}|\)
1.2.3 马哈拉诺比斯距离 Mahalanobis Distance
马哈拉诺比斯距离(Mahalanobis Distance):是一种尺度无关的距离度量指标。给定一个样本集合 \(\boldsymbol{X} = [x_{ik}]\),其协方差矩阵记作 \(\boldsymbol{S}\)。两个样本 \(\boldsymbol{x}_i\) 和 \(\boldsymbol{x}_j\) 之间的距离 \(d_{ij}\) 定义为:
当 \(\boldsymbol{S}\) 为单位矩阵时,即样本数据集的各个特征(属性)之间相互独立且方差均为 0 时,马哈拉诺比斯距离与欧式距离相同。
Haversine Distance
半矢正弦 (Haversine Distance) 或大圆距离(Great Circle distance)
-
其中 \(x_1\) 和 \(y_1\) 表示维度(latitude),\(x_2\) 和 \(y_2\) 表示经度(longitude),单位为弧度(radian)
-
\(r\) is the radius of the sphere (the earth)
1.3 相关性指标 Similarity Measure
1.3.1 相关系数 Correlation Coefficient
相关系数(Correlation Coefficient): 两个样本 \(\boldsymbol{x}_i\) 和 \(\boldsymbol{x}_j\) 的相关系数 \(r_{ij}\) 定义为:
其中:
-
相关系数 \(r\) 的取值范围:\(-1 \leq r \leq 1\)
-
相关系数的绝对值越接近于 1,表示样本越相似;越接近于 0,表示样本越不相似。
-
关于相关系数的研究,参考博客:Correlation Analysis
1.3.2 余弦夹角 Cosine
余弦夹角(Cosine, or cosine similarity): 两个样本 \(\boldsymbol{x}_i\) 和 \(\boldsymbol{x}_j\) 的相关系数 \(s_{ij}\) 定义为:
其中,\(\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle\) 表示向量 \(\boldsymbol{x}_i\) 和向量 \(\boldsymbol{x}_j\) 的内积(inner product),即 \(\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle = \boldsymbol{x}_i^{\top} \boldsymbol{x}_j\)
-
余弦夹角 \(s\) 的取值范围:\(0 \leq r \leq 1\)
-
相关系数的绝对值越接近于 1,表示样本越相似;越接近于 0,表示样本越不相似。
1.4 离散变量的距离度量
-
连续变量 continuous variable
-
离散变量 categorical variable
-
有序变量 ordinal variable
-
无序变量 non-ordinal (or nominal) variable
-
1.4.1 Ordinal categorical variable
Ordinal categorical variable 的例子:
-
满意度:非常不满意,比较不满意,一般,比较满意,非常满意
-
成绩:A,B,C,D,E,F
假设数据集 \(\boldsymbol{X}\) 的第 \(k\) 个属性值为 Ordinal categorical variable,可以将该属性的排序值(Rank)作为变量值,然后再当作连续变量处理,即:
其中,\(R(x_{i,k})\) 表示样本 \(\boldsymbol{x}\) 的第 \(k\) 个属性的属性值 \(x_{i,k}\) 在所数据集中的排序值。
或者计算(Hastie et al, 2009):
其中,\(L\) 表示数据集 \(\boldsymbol{X}\) 中属性 \(k\) 的种类总数。
1.4.2 Nominal variable
VDM (Value Difference Metric):
假设数据集 \(\mathbf{X}\) 的第 \(k\) 个属性为离散变量,对于两个样本 \(\boldsymbol{x}_i\) 和 \(\boldsymbol{x}_j\) 其属性 \(k\) 的值分别为 \(a\) 和 \(b\),即 $x_{i,k}=a, \(x_{j,k}=b\);定义以下变量
-
\(m_{k,a}\) 表示在属性 \(k\) 上取值为 \(a\) 的样本数
-
\(m_{k,a,l}\) 表示在第 \(l\) 个样本簇中在属性 \(k\) 上取值为 \(a\) 的样本数
-
\(L\) 为样本簇数
1.5 混合变量
闵可夫斯基距离和 VDM 结合。假设数据集有 \(K_c\) 个连续变量,\(K-K_c\) 个离散变量,则有
其他方法:K-mode 和 k-prototypes: github
2. 类或簇
2.1 类的定义和属性
定义符号:
-
\(G_i\) :第 \(i\) 类或簇(cluster)
-
\(n_{G_i} = |G_i|\):表示类 \(G_i\) 中样本的个数
2.1.1 类或簇(cluster)的定义
类或簇(cluster): 设 \(T\) 表示阈值,为一个给定的正数,若集合 \(G\) 中任意两个样本 \(\boldsymbol{x}_i\) 和 \(\boldsymbol{x}_j\),有:
则称 \(G\) 为一个类或簇。
2.1.2 类的常用特征
类的均值(mean),或称为中心(centriod) \(\bar{\boldsymbol{x}}_G\):
其中, \(n_G\) 表示类 \(G\) 的样本个数。
类的直径(diameter) \(\mathrm{diam}(G)\):类 \(G\) 中任意两个样本之间的最大距离:
类的样本平均距离 \(\mathrm{ave} (G)\):
类的样本散布矩阵(scatter matrix) \(A_G\):
类的样本协方差矩阵(covariance matrix) \(S_G\):
其中 \(K\) 表示样本的维度(即,样本的属性个数)
2.2 类与类之间的距离度量
定义:
-
两个类 \(G_p\) 和 \(G_q\) 之间的距离 \(D(p,q)\),也称为连接(linkage)。
-
\(G_p\) 包含 \(n_p\) 个样本,中心(均值)为 \(\bar{\boldsymbol{x}}_p\)
-
\(G_q\) 包含 \(n_q\) 个样本,中心(均值)为 \(\bar{\boldsymbol{x}}_q\)
最短距离或单连接(single or minimum linkage):
最长距离或完全连接(complete or maximum linkage):
中心距离 (centroid linkage):
平均距离 (average or mean linkage):
3. 性能度量
优秀的聚类结果:“簇内相似度(intra-cluster similarity) ”高,且“簇间相似度"(inter-cluster similarity) 低”。
-
外部指标 (external index): 将聚类结果与某个“参考模型(reference model)”进行比较。
-
内部指标 (internal index): 直接考察聚类结果而不利用任何参考模型
定义符号:
-
\(\mathcal{D} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_N\}\) 为数据集;\(N\) 为样本总数
-
\(\mathcal{C} = \{C_1, C_2, \cdots, C_K \}\):通过聚类给出的簇划分;\(K\) 为聚类的聚类簇数
-
\(\mathcal{C}^* = \{C_1^*, C_2^*, \cdots, C_S^* \}\):参考模型给出的簇划分;\(S\) 为参考模型的聚类簇数
-
令 \(\lambda\) 与 \(\lambda^*\) 分别表示 \(\mathcal{C}\) 和 \(\mathcal{C}^*\) 对应的簇标记向量。
-
将样本两两配对,定义:
\[\begin{aligned} a=|S S|, \quad & S S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ b=|S D|, \quad & S D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \\ c=|D S|, \quad & D S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ d=|D D|, \quad & D D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \end{aligned} \]-
其中,集合 \(SS\) 表示包含了在 \(\mathcal{C}\) 中隶属于相同簇,且在 \(\mathcal{C}^*\) 中也隶属于相同簇的样本对
-
集合 \(SD\) 表示包含了在 \(\mathcal{C}\) 中隶属于相同簇,但在 \(\mathcal{C}^*\) 中也隶属于不同簇的样本对
-
集合 \(DS\) 表示包含了在 \(\mathcal{C}\) 中隶属于不同簇,但在 \(\mathcal{C}^*\) 中也隶属于相同簇的样本对
-
集合 \(DD\) 表示包含了在 \(\mathcal{C}\) 中隶属于不同簇,且在 \(\mathcal{C}^*\) 中也隶属于不同簇的样本对
-
由于每个样本对 \((\boldsymbol{x}_{i}, \boldsymbol{x}_{j})\) 仅能出现在一个集合中,因此有:
-
使用 contingency table (Rand index, wikipedia, 2023)
其中 \(N = \sum_{i} a_i = \sum_{j} b_j\)
3.0 总结
-
外部指标
-
内部指标
3.1 聚类性能度量的外部指标
(1) Jaccard 系数(Jaccard Coefficient,JC)
- \(\mathrm{JC} \in [0,1]\),值越大越好。
(2) FM 指数(Fowlkes and Mallows lndex,FMI)
-
Random (uniform) label assignments have a FMI score close to 0
-
Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement
- \(\mathrm{FMI} \in [0, 1]\)
(3) Rand 指数(Rand Index,RI)
Adjusted Rand Index (ARI) (wikipedia, 2023)
-
Random (uniform) label assignments have an ARI score close to 0
-
RI does not guarantee that random label assignments will get a value close to zero (especially if the number of clusters is in the same order of magnitude as the number of samples)
-
Lower values indicate different labelings, and similar clusterings have a high RI or ARI. 1 is the perfect match score.
- \(\mathrm{RI} \in [0, 1]\) and \(\mathrm{ARI} \in [-1, 1]\)
3.2 聚类性能度量的内部指标
3.2.1 CH Index
Calinski-Harabasz Index (CHI): defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:
-
where \(\mathrm{tr}(\mathbf{B})\) is the trace of the between-cluster dispersion matrix
-
and \(\mathrm{tr}(\mathbf{W})\) is the trace of the within-cluster dispersion matrix
-
where \(\mathcal{C}_k\) is the set of samples in cluster \(k\)
-
\(\boldsymbol{c}_k\) is the center (centroid) of cluster \(k\)
-
\(\boldsymbol{c}_0\) is the center of whole dataset
-
\(N_k\) the number of samples in cluster \(k\)
Properties
-
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
-
\(\mathrm{CHI} \in [0, +\infty)\)
3.2.2 DB Index
DB 指数(Davies-Bouldin Index, DBI)(周志华, 2016)
另一种计算方法(Wikipedia, 2022; scikit-learn, 2022):
-
其中,\(R(\mathcal{C}_k, \mathcal{C}_s)\) 衡量两个簇 \(\mathcal{C}_k\) 和 \(\mathcal{C}_s\) 之间的距离;是非负数并且是对称的,即:\(R(\mathcal{C}_k, \mathcal{C}_s) = R(\mathcal{C}_s, \mathcal{C}_k) \geq 0\)
\[R(\mathcal{C}_k, \mathcal{C}_s) = \frac{S(\mathcal{C}_k) + S(\mathcal{C}_s)} {d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)} \] -
其中 \(S(\mathcal{C}_k)\) 表示 \(\mathcal{C}_k\) 簇中所有样本到簇中心 \(\mathrm{cen} (\mathcal{C}_k)\) 的平均距离:
\[S(\mathcal{C}_k) = \frac{1}{|\mathcal{C}_k|} \sum_{\boldsymbol{x}_i \in \mathcal{C}_k} \text{dist} \big( \boldsymbol{x}_i, \mathrm{cen} (\mathcal{C}_k) \big) \] -
\(d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)\) 表示两个簇中心之间的距离:
\[d_{\mathrm{cen}} (\mathcal{C}_k, \mathcal{C}_s) = d_{\mathrm{cen}} (\bar{\boldsymbol{x}}_k, \bar{\boldsymbol{x}}_s) \] -
The DBI is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.
-
The usage of centroid distance limits the distance metric to Euclidean space.
3.2.3 Dunn Index
Dunn 指数(Dunn Index, DI)DI for cluster \(\mathcal{C}_k\) (wikipedia, Dunn index, 2023)
where \(\delta( \mathcal{C}_{i}, \mathcal{C}_{j})\) indicates the inter-cluster distance between \(\mathcal{C}_{i}\) and \(\mathcal{C}_{j}\); \(\Delta (\mathcal{C}_k)\) indicates the intra-cluster distance of cluster \(\mathcal{C}_k\).
The DI index over the all clusters (周志华, 2016, pp. 199 & jqmviegas, 2023):
- DI 值越大越好
3.2.4 Silhouette Index
For each data point \(\boldsymbol{x}_i\), the Silhouette index (SI) is computed as:
-
where \(a(i)\) is the mean distance between the sample \(\boldsymbol{x}_i\) and all other points in the same class (denoted as \(\mathcal{C}_I\));
-
\(b(i)\) is the mean distance between a sample \(\boldsymbol{x}_i\) and all other points in the next nearest cluster
The overall Silhouette index for the whole dataset:
-
Simple mean over the all samples, implemented by
sklearn.metrics.silhouette_score\[\mathrm{SI} = \frac{1}{N} \sum_{i} s(i) \] -
Silhouette coefficient (Kaufman et al.): the maximum value of the mean \(s(i)\) over all data of the entire dataset
\[\mathrm{SI} = \max_{k} \bar{s}(k) = \max_{k} \sum_{\boldsymbol{x}_i \in \mathcal{C}_{k}} {s}(i) \]- where where \(\bar{s}(k)\) represents the mean \(s(i)\) over all data of the entire dataset for a specific number of clusters \(k\).
Properites
-
a higher SI score relates to a model with better defined clusters.
-
\(s(i) \in [-1, 1]\)
-
\(-1\) indicates incorrect clustering and \(+1\) indicates highly dense clustering.
-
Scores around \(0\) indicate overlapping clusters
-
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
-
4 Python 中实现
4.1 距离度量
4.1.2 sklearn 库
sklearn 库中 sklearn.metrics.pairwise 模,site
-
cosine_similarity(X, Y=None, dense_output=True) -
cosine_distances(X, Y=None, *):paired_cosine_distances(X, Y)- equals 1 minus the cosine similarity
-
euclidean_distances(X, Y=None, *)和paired_euclidean_distances(X, Y) -
manhattan_distances(X, Y=None, *)和paired_manhattan_distances(X, Y) -
haversine_distances(X, Y=None): Haversine (or great circle) distance- Parameters:
X, Y array-like of shape (n_samples_X, 2), the first coordinate of each point is the latitude, the second is the longitude, given in radians
- Parameters:
-
nan_euclidean_distances(X, Y=None, *)()-
Calculate the euclidean distances in the presence of missing values.
-
dist(x,y) = sqrt(weight * sq. distance from present coordinates)where,weight = Total # of coordinates / # of present coordinates -
Example: \(x_1=[3, \mathrm{NA}, \mathrm{NA}, 6]\), \(x_2=[1, \mathrm{NA}, 4, 5]\), \(\mathrm{dist}(x_1, x_2)=\sqrt{\dfrac{4}{2}[(3-1)^2 + (6-5)^2]}\)
-
通用形式:
-
metrics.pairwise_distances(X, Y=None, metric='euclidean')- 计算
X与Y的距离矩阵
- 计算
-
metrics.pairwise.paired_distances(X, Y, *, metric='euclidean')- 计算
X与Y一一对应的样本距离,X与X尺寸必须相同
- 计算
-
metric参数:-
From
scikit-learn:['cityblock', 'cosine', 'euclidean', 'l1’', 'l2', 'manhattan']. These metrics support sparse matrix inputs.['nan_euclidean']but it does not yet support sparse matrices. -
From
scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']
-
其他:kernel distance
4.1.3 scipy 库
scipy 库中 spatial.distance 模块中的 cdist() 函数
dist = scipy.spatial.distance.cdist(XA, XB, metric='euclidean', *)
References
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). "Section 14.3 Cluster Analysis" in The elements of statistical learning: Data mining, inference, and prediction (2nd ed). Springer.
周志华, "第 9 章 聚类", in 机器学习(第二版), 2016.
李航, "14.1 聚类的基本概念", 统计学习方法(第二版).
Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of artificial intelligence research, 6, 1-34.
Payne, T. R., & Edwards, P. (1998). Implicit feature selection with the value difference metric. In Proceedings of the European Conference on Artificial Intelligence-ECAI-98. Wiley.
scikit-learn documentation
- Clustering performance evaluation, site
wikipedia:

浙公网安备 33010602011771号