Cluster Analysis - Similarities and Measurements

1. 样本相似性度量 Similarity Measurement

1.1 定义

1.1.1 相似度与距离

  • 相似度(Similarity):衡量两个样本的相似程度

  • 距离(Distance)或不相似度(dissimilarity):距离越小,表示相似度越大

1.1.2 符号含义

  • 样本 \(\boldsymbol{x}_i, \boldsymbol{x}_j \in \mathcal{X} \subseteq \mathbb{R}^n\)

    \(\boldsymbol{x}_i = \left[x_{i,1}, \cdots, x_{i,k}, \cdots, x_{i,K} \right]^{\top}\)

  • 样本属性(特征):\(k=1,2,\cdots,K\)

1.2 距离度量 Distance measure

1.2.1 距离度量的基本性质

对于一个定义的“距离度量(distance measure)”,应当满足以下的基本性质:

  • 非负性: \(d(\boldsymbol{x}_i) \geq 0\)

  • 同一性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) = 0\) 当切仅当 \(\boldsymbol{x}_i = \boldsymbol{x}_j\)

  • 对称性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) = d(\boldsymbol{x}_j, \boldsymbol{x}_i)\)

  • 直递性: \(d(\boldsymbol{x}_i, \boldsymbol{x}_j) \leq d(\boldsymbol{x}_i, \boldsymbol{x}_k) + d(\boldsymbol{x}_k, \boldsymbol{x}_j)\)

1.2.2 闵可夫斯基距离 Minkowski Distance

闵可夫斯基距离 Minkowski Distance: 样本 \(\boldsymbol{x}_i\) 和 样本 \(\boldsymbol{x}_j\) 的闵可夫斯基距离(Minkowski Distance)定义为:

\[d_{i j} = \left(\sum_{k=1}^{K}\left|x_{i, k}-x_{j, k}\right|^{p}\right)^{\frac{1}{p}} \]

其中 \(p\geq1\)

  • \(p=1\) 时称为曼哈顿距离(Manhattan Distance)

  • \(p=2\) 时称为欧式距离(Euclidean Distance)

  • \(p=+\infty\) 时称为切比雪夫距离(Chebyshev Distance):\(d_{ij} = \max \limits_k |x_{i,k} - x_{j,k}|\)

1.2.3 马哈拉诺比斯距离 Mahalanobis Distance

马哈拉诺比斯距离(Mahalanobis Distance):是一种尺度无关的距离度量指标。给定一个样本集合 \(\boldsymbol{X} = [x_{ik}]\),其协方差矩阵记作 \(\boldsymbol{S}\)。两个样本 \(\boldsymbol{x}_i\)\(\boldsymbol{x}_j\) 之间的距离 \(d_{ij}\) 定义为:

\[d_{i j}=\left[\left(x_{i}-x_{j}\right)^{\top} \boldsymbol{S}^{-1}\left(x_{i}-x_{j}\right)\right]^{\frac{1}{2}} \]

\(\boldsymbol{S}\) 为单位矩阵时,即样本数据集的各个特征(属性)之间相互独立且方差均为 0 时,马哈拉诺比斯距离欧式距离相同。

Haversine Distance

半矢正弦 (Haversine Distance) 或大圆距离(Great Circle distance)

\[\mathrm{dist}(x, y) = 2 r \arcsin \left[ \sqrt{ \sin^2 \left( \frac{x_1 - y_1}{2} \right) + \cos(x_1) \cos(y_1) \sin^2 \left( \frac{x_2 - y_2}{2} \right)} \right] \]

  • 其中 \(x_1\)\(y_1\) 表示维度(latitude),\(x_2\)\(y_2\) 表示经度(longitude),单位为弧度(radian)

  • \(r\) is the radius of the sphere (the earth)

1.3 相关性指标 Similarity Measure

1.3.1 相关系数 Correlation Coefficient

相关系数(Correlation Coefficient): 两个样本 \(\boldsymbol{x}_i\)\(\boldsymbol{x}_j\) 的相关系数 \(r_{ij}\) 定义为:

\[r_{i j} =\frac{\sum \limits_{k=1}^{K}\left(x_{i, k}-\bar{\boldsymbol{x}}_{i}\right)\left(x_{j, k}-\bar{\boldsymbol{x}}_{j}\right)}{\left[\sum \limits_{k=1}^{K}\left(x_{i, k}-\bar{\boldsymbol{x}}_{i}\right)^{2} \cdot \sum \limits_{k=1}^{K}\left(x_{j, k}-\bar{\boldsymbol{x}}_{j}\right)^{2}\right]^{\frac{1}{2}}} \]

其中:

\[\bar{\boldsymbol{x}}_{i} =\frac{1}{K} \sum_{k=1}^{K} x_{i, k}, \quad \bar{\boldsymbol{x}}_{j} =\frac{1}{K} \sum_{k=1}^{K} x_{j, k} \]

  • 相关系数 \(r\) 的取值范围:\(-1 \leq r \leq 1\)

  • 相关系数的绝对值越接近于 1,表示样本越相似;越接近于 0,表示样本越不相似。

  • 关于相关系数的研究,参考博客:Correlation Analysis

1.3.2 余弦夹角 Cosine

余弦夹角(Cosine, or cosine similarity): 两个样本 \(\boldsymbol{x}_i\)\(\boldsymbol{x}_j\) 的相关系数 \(s_{ij}\) 定义为:

\[s_{i j} = \frac{\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle}{\| \boldsymbol{x}_i \| \cdot \| \boldsymbol{x}_j \|} \\ = \frac{\sum \limits_{k=1}^{K} x_{i, k} \cdot x_{j, k }}{\sqrt{\sum \limits_{k=1}^{K} (x_{i, k})^{2} \cdot \sum \limits_{k=1}^{K} (x_{j, k})^{2}}} \]

其中,\(\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle\) 表示向量 \(\boldsymbol{x}_i\) 和向量 \(\boldsymbol{x}_j\) 的内积(inner product),即 \(\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle = \boldsymbol{x}_i^{\top} \boldsymbol{x}_j\)

  • 余弦夹角 \(s\) 的取值范围:\(0 \leq r \leq 1\)

  • 相关系数的绝对值越接近于 1,表示样本越相似;越接近于 0,表示样本越不相似。

1.4 离散变量的距离度量

  • 连续变量 continuous variable

  • 离散变量 categorical variable

    • 有序变量 ordinal variable

    • 无序变量 non-ordinal (or nominal) variable

1.4.1 Ordinal categorical variable

Ordinal categorical variable 的例子:

  • 满意度:非常不满意,比较不满意,一般,比较满意,非常满意

  • 成绩:A,B,C,D,E,F

假设数据集 \(\boldsymbol{X}\) 的第 \(k\) 个属性值为 Ordinal categorical variable,可以将该属性的排序值(Rank)作为变量值,然后再当作连续变量处理,即:

\[x_{i,k}' = R(x_{i,k}), \quad \forall \ i \]

其中,\(R(x_{i,k})\) 表示样本 \(\boldsymbol{x}\) 的第 \(k\) 个属性的属性值 \(x_{i,k}\) 在所数据集中的排序值。

或者计算(Hastie et al, 2009):

\[x_{i,k}' = \frac{R(x_{i,k})-\frac{1}{2}}{L}, \quad \forall \ i \]

其中,\(L\) 表示数据集 \(\boldsymbol{X}\) 中属性 \(k\) 的种类总数。

1.4.2 Nominal variable

VDM (Value Difference Metric):

假设数据集 \(\mathbf{X}\) 的第 \(k\) 个属性为离散变量,对于两个样本 \(\boldsymbol{x}_i\)\(\boldsymbol{x}_j\) 其属性 \(k\) 的值分别为 \(a\)\(b\),即 $x_{i,k}=a, \(x_{j,k}=b\);定义以下变量

  • \(m_{k,a}\) 表示在属性 \(k\) 上取值为 \(a\) 的样本数

  • \(m_{k,a,l}\) 表示在第 \(l\) 个样本簇中在属性 \(k\) 上取值为 \(a\) 的样本数

  • \(L\) 为样本簇数

\[\mathrm{VDM}_p(x_{i,k}, x_{j,k}) = \mathrm{VDM}_p(a, b) = \sum_{l=1}^{L} \left| \frac{m_{k,a,l}}{m_{k,a}} - \frac{m_{k,b,l}}{m_{k,b}} \right|^{p} \]

1.5 混合变量

闵可夫斯基距离和 VDM 结合。假设数据集有 \(K_c\) 个连续变量,\(K-K_c\) 个离散变量,则有

\[\text{MinkovDM}_{p} (\boldsymbol{x}_{i}, \boldsymbol{x}_{j}) = \left[\sum_{k=1}^{K_{c}} \left|x_{i, k}-x_{j, k} \right|^{p} + \sum_{k=K_{c}+1}^{K} \text{VDM}_{p} \left(x_{i, k}, x_{j, k} \right) \right]^{\frac{1}{p}} \]

其他方法:K-mode 和 k-prototypes: github

2. 类或簇

2.1 类的定义和属性

定义符号:

  • \(G_i\) :第 \(i\) 类或簇(cluster)

  • \(n_{G_i} = |G_i|\):表示类 \(G_i\) 中样本的个数

2.1.1 类或簇(cluster)的定义

类或簇(cluster):\(T\) 表示阈值,为一个给定的正数,若集合 \(G\) 中任意两个样本 \(\boldsymbol{x}_i\)\(\boldsymbol{x}_j\),有:

\[d_{ij} \leq T \]

则称 \(G\) 为一个类或簇。

2.1.2 类的常用特征

类的均值(mean),或称为中心(centriod) \(\bar{\boldsymbol{x}}_G\)

\[\text{cen}(G) = \mu_G = \bar{\boldsymbol{x}}_G = \frac{1}{n_G} \sum_{i=1}^{n_G} \boldsymbol{x}_i = \frac{1}{|G|} \sum_{i=1}^{|G|} \boldsymbol{x}_i \]

其中, \(n_G\) 表示类 \(G\) 的样本个数。

类的直径(diameter) \(\mathrm{diam}(G)\):类 \(G\) 中任意两个样本之间的最大距离:

\[\mathrm{diam}(G) = \max_{\boldsymbol{x}_i, \ \boldsymbol{x}_j \ \in \ G} \{ d_{ij} \} = \max_{1 \leq i < j \leq |G|} \{ d_{ij} \} \]

类的样本平均距离 \(\mathrm{ave} (G)\)

\[\mathrm{ave} (G) = \frac{2}{|G|(|G|-1)} \sum_{1 \leq i < j \leq |G|} d_{ij} \]

类的样本散布矩阵(scatter matrix) \(A_G\)

\[A_G = \sum_{i=1}^{n_G} (\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)(\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)^{\top} \]

类的样本协方差矩阵(covariance matrix) \(S_G\)

\[S_G = \frac{1}{K-1} A_G = \frac{1}{K-1} \sum_{i=1}^{n_G} (\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)(\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)^{\top} \]

其中 \(K\) 表示样本的维度(即,样本的属性个数)

2.2 类与类之间的距离度量

定义:

  • 两个类 \(G_p\)\(G_q\) 之间的距离 \(D(p,q)\),也称为连接(linkage)

  • \(G_p\) 包含 \(n_p\) 个样本,中心(均值)为 \(\bar{\boldsymbol{x}}_p\)

  • \(G_q\) 包含 \(n_q\) 个样本,中心(均值)为 \(\bar{\boldsymbol{x}}_q\)

最短距离单连接(single or minimum linkage)

\[d_{\min}(G_p, G_q) = \min_{\boldsymbol{x}_i \in G_p \ , \ \boldsymbol{x}_j \in G_q} \ \{ d_{ij} \} \]

最长距离完全连接(complete or maximum linkage)

\[d_{\max}(G_p, G_q) = \max_{\boldsymbol{x}_i \in G_p \ , \ \boldsymbol{x}_j \in G_q} \ \{ d_{ij} \} \]

中心距离 (centroid linkage):

\[d_{\mathrm{cen}} (G_p, G_q) = d_{\mathrm{cen}} (\boldsymbol{\mu}_{p}, \boldsymbol{\mu}_{q}) = d_{\bar{\boldsymbol{x}}_p \ \bar{\boldsymbol{x}}_q} \]

平均距离 (average or mean linkage):

\[d_{\mathrm{ave}}(G_p, G_q) = \frac{1}{n_p \ n_q} \sum_{\boldsymbol{x}_i \in G_p} \sum_{\boldsymbol{x}_j \in G_q} d_{ij} \]

3. 性能度量

优秀的聚类结果:“簇内相似度(intra-cluster similarity) ”高,且“簇间相似度"(inter-cluster similarity) 低”。

  • 外部指标 (external index): 将聚类结果与某个“参考模型(reference model)”进行比较。

  • 内部指标 (internal index): 直接考察聚类结果而不利用任何参考模型

定义符号:

  • \(\mathcal{D} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_N\}\) 为数据集;\(N\) 为样本总数

  • \(\mathcal{C} = \{C_1, C_2, \cdots, C_K \}\):通过聚类给出的簇划分;\(K\) 为聚类的聚类簇数

  • \(\mathcal{C}^* = \{C_1^*, C_2^*, \cdots, C_S^* \}\):参考模型给出的簇划分;\(S\) 为参考模型的聚类簇数

  • \(\lambda\)\(\lambda^*\) 分别表示 \(\mathcal{C}\)\(\mathcal{C}^*\) 对应的簇标记向量。

  • 将样本两两配对,定义:

    \[\begin{aligned} a=|S S|, \quad & S S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ b=|S D|, \quad & S D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \\ c=|D S|, \quad & D S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ d=|D D|, \quad & D D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \end{aligned} \]

    • 其中,集合 \(SS\) 表示包含了在 \(\mathcal{C}\) 中隶属于相同簇,且在 \(\mathcal{C}^*\) 中也隶属于相同簇的样本对

    • 集合 \(SD\) 表示包含了在 \(\mathcal{C}\) 中隶属于相同簇,但在 \(\mathcal{C}^*\) 中也隶属于不同簇的样本对

    • 集合 \(DS\) 表示包含了在 \(\mathcal{C}\) 中隶属于不同簇,但在 \(\mathcal{C}^*\) 中也隶属于相同簇的样本对

    • 集合 \(DD\) 表示包含了在 \(\mathcal{C}\) 中隶属于不同簇,且在 \(\mathcal{C}^*\) 中也隶属于不同簇的样本对

    • 由于每个样本对 \((\boldsymbol{x}_{i}, \boldsymbol{x}_{j})\) 仅能出现在一个集合中,因此有:

\[a+b+c+d = \binom {N}{2} = \frac{N(N-1)}{2} \]

使用 contingency table (Rand index, wikipedia, 2023)

\[\begin{array}{c|cccc|c} {{} \atop \mathcal{C}} \! \diagdown \!^{\mathcal{C}^*} & \mathcal{C}^*_{1} & \mathcal{C}^*_{2} & \cdots & \mathcal{C}^*_{S} &{\text{sums}} \\ \hline \mathcal{C}_{1} & n_{11} & n_{12} & \cdots & n_{1S} & a_{1} \\ \mathcal{C}_{2} & n_{21} & n_{22} & \cdots & n_{2S} & a_{2} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ \mathcal{C}_{K} & n_{K1} & n_{K2} & \cdots & n_{KS} & a_{K} \\ \hline {\text{sums}} & b_{1} & b_{2} & \cdots & b_{S} & N \end{array} \]

其中 \(N = \sum_{i} a_i = \sum_{j} b_j\)

3.0 总结

  • 外部指标

  • 内部指标

3.1 聚类性能度量的外部指标

(1) Jaccard 系数(Jaccard Coefficient,JC)

\[\mathrm{JC} = \frac{a}{a+b+c} \]

  • \(\mathrm{JC} \in [0,1]\),值越大越好。

(2) FM 指数(Fowlkes and Mallows lndex,FMI)

\[\text{FMI} = \sqrt{\frac{a}{a+b} \cdot \frac{a}{a+c}} \]

  • Random (uniform) label assignments have a FMI score close to 0

  • Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement

    • \(\mathrm{FMI} \in [0, 1]\)

(3) Rand 指数(Rand Index,RI)

\[\text{RI} = \frac{(a+d)}{N(N-1) / 2} = \frac{2 (a+d)}{N(N-1)} = \frac{a+d}{a+b+c+d} \]

Adjusted Rand Index (ARI) (wikipedia, 2023)

\[\mathrm{ARI} = \frac{ \text{RI} - \mathbb{E}[\text{RI}] }{ \max(\text{RI}) - \mathbb{E}[\text{RI}] } \]

  • Random (uniform) label assignments have an ARI score close to 0

  • RI does not guarantee that random label assignments will get a value close to zero (especially if the number of clusters is in the same order of magnitude as the number of samples)

  • Lower values indicate different labelings, and similar clusterings have a high RI or ARI. 1 is the perfect match score.

    • \(\mathrm{RI} \in [0, 1]\) and \(\mathrm{ARI} \in [-1, 1]\)

3.2 聚类性能度量的内部指标

3.2.1 CH Index

Calinski-Harabasz Index (CHI): defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:

\[\mathrm{CHI} = \frac{\mathrm{tr}(\mathbf{B}) / (K-1)}{\mathrm{tr}(\mathbf{W}) / (N-K)} = \frac{\mathrm{tr}(\mathbf{B})}{\mathrm{tr}(\mathbf{W})} \cdot \frac{N-K}{K-1} \]

  • where \(\mathrm{tr}(\mathbf{B})\) is the trace of the between-cluster dispersion matrix

  • and \(\mathrm{tr}(\mathbf{W})\) is the trace of the within-cluster dispersion matrix

\[\begin{aligned} \mathbf{W} &= \sum_{k=1}^K \mathbf{W}_k = \sum_{k=1}^K \sum_{ \boldsymbol{x}_i \in \mathcal{C}_k} (\boldsymbol{x}_i - \boldsymbol{c}_k) (\boldsymbol{x}_i - \boldsymbol{c}_k)^{\top} \\ \mathbf{B} &= \sum_{k=1}^K \mathbf{B}_k = \sum_{k=1}^K N_k (\boldsymbol{c}_k - \boldsymbol{c}_0) (\boldsymbol{c}_k - \boldsymbol{c}_0)^T \end{aligned} \]

  • where \(\mathcal{C}_k\) is the set of samples in cluster \(k\)

  • \(\boldsymbol{c}_k\) is the center (centroid) of cluster \(k\)

  • \(\boldsymbol{c}_0\) is the center of whole dataset

  • \(N_k\) the number of samples in cluster \(k\)

Properties

  • The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

  • \(\mathrm{CHI} \in [0, +\infty)\)

3.2.2 DB Index

DB 指数(Davies-Bouldin Index, DBI)(周志华, 2016)

\[\mathrm{DBI} = \frac{1}{K} \sum_{p=1}^{K} \ \max _{q, \ q \neq p} \left\{ \frac{\mathrm{avg}(C_{p}) + \mathrm{avg} (C_{q})}{d_{\mathrm{cen}} (C_{p}, C_{q})} \right\} \]

另一种计算方法(Wikipedia, 2022; scikit-learn, 2022):

\[\mathrm{DBI} = \frac{1}{K} \sum_{k=1}^K \max_{k \neq s} R(\mathcal{C}_k, \mathcal{C}_s) \]

  • 其中,\(R(\mathcal{C}_k, \mathcal{C}_s)\) 衡量两个簇 \(\mathcal{C}_k\)\(\mathcal{C}_s\) 之间的距离;是非负数并且是对称的,即:\(R(\mathcal{C}_k, \mathcal{C}_s) = R(\mathcal{C}_s, \mathcal{C}_k) \geq 0\)

    \[R(\mathcal{C}_k, \mathcal{C}_s) = \frac{S(\mathcal{C}_k) + S(\mathcal{C}_s)} {d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)} \]

  • 其中 \(S(\mathcal{C}_k)\) 表示 \(\mathcal{C}_k\) 簇中所有样本到簇中心 \(\mathrm{cen} (\mathcal{C}_k)\) 的平均距离:

    \[S(\mathcal{C}_k) = \frac{1}{|\mathcal{C}_k|} \sum_{\boldsymbol{x}_i \in \mathcal{C}_k} \text{dist} \big( \boldsymbol{x}_i, \mathrm{cen} (\mathcal{C}_k) \big) \]

  • \(d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)\) 表示两个簇中心之间的距离:

    \[d_{\mathrm{cen}} (\mathcal{C}_k, \mathcal{C}_s) = d_{\mathrm{cen}} (\bar{\boldsymbol{x}}_k, \bar{\boldsymbol{x}}_s) \]

  • The DBI is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

  • The usage of centroid distance limits the distance metric to Euclidean space.

3.2.3 Dunn Index

Dunn 指数(Dunn Index, DI)DI for cluster \(\mathcal{C}_k\) (wikipedia, Dunn index, 2023)

\[\mathrm{DI}_{k} = \frac{ \min \limits_{1 \leq p < q \leq k} \delta( \mathcal{C}_{p}, \mathcal{C}_{q}) }{ \max \limits_{1 \leq p \leq k} \Delta (\mathcal{C}_p) } \]

where \(\delta( \mathcal{C}_{i}, \mathcal{C}_{j})\) indicates the inter-cluster distance between \(\mathcal{C}_{i}\) and \(\mathcal{C}_{j}\); \(\Delta (\mathcal{C}_k)\) indicates the intra-cluster distance of cluster \(\mathcal{C}_k\).

The DI index over the all clusters (周志华, 2016, pp. 199 & jqmviegas, 2023):

\[\mathrm{DI} = \frac{ \min \limits_{k, s} d_{\min } ( \mathcal{C}_{k}, \mathcal{C}_{s}) }{ \max \limits_{k} \ \mathrm{diam} ( \mathcal{C}_k ) } \]

  • DI 值越大越好

3.2.4 Silhouette Index

For each data point \(\boldsymbol{x}_i\), the Silhouette index (SI) is computed as:

\[\begin{aligned} &s(i) = {\frac {b(i)-a(i)}{ \max\{a(i),b(i)\}}}, && \text{If} \quad |\mathcal{C}_{I}|>1 \\ &s(i) = 0, && \text{If} \quad |\mathcal{C}_{I}|=1, \text{ i.e. } \mathcal{C}_I = \{ \boldsymbol{x}_i \} \end{aligned} \]

  • where \(a(i)\) is the mean distance between the sample \(\boldsymbol{x}_i\) and all other points in the same class (denoted as \(\mathcal{C}_I\));

  • \(b(i)\) is the mean distance between a sample \(\boldsymbol{x}_i\) and all other points in the next nearest cluster

\[\begin{aligned} a(i) &= {\frac {1}{ |\mathcal{C}_I|-1}}\sum _{j \in \mathcal{C}_{I}, i\neq j} d(\boldsymbol{x}_i, \boldsymbol{x}_j) \qquad \text{where} \quad \boldsymbol{x}_i \in \mathcal{C}_{I} \\ b(i) &= \min _{J \neq I}{\frac {1}{|\mathcal{C}_{J}|}}\sum _{j\in \mathcal{C}_{J}} d(\boldsymbol{x}_i, \boldsymbol{x}_j) \end{aligned} \]

The overall Silhouette index for the whole dataset:

  • Simple mean over the all samples, implemented by sklearn.metrics.silhouette_score

    \[\mathrm{SI} = \frac{1}{N} \sum_{i} s(i) \]

  • Silhouette coefficient (Kaufman et al.): the maximum value of the mean \(s(i)\) over all data of the entire dataset

    \[\mathrm{SI} = \max_{k} \bar{s}(k) = \max_{k} \sum_{\boldsymbol{x}_i \in \mathcal{C}_{k}} {s}(i) \]

    • where where \(\bar{s}(k)\) represents the mean \(s(i)\) over all data of the entire dataset for a specific number of clusters \(k\).

Properites

  • a higher SI score relates to a model with better defined clusters.

    • \(s(i) \in [-1, 1]\)

    • \(-1\) indicates incorrect clustering and \(+1\) indicates highly dense clustering.

    • Scores around \(0\) indicate overlapping clusters

    • The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

4 Python 中实现

4.1 距离度量

4.1.2 sklearn

sklearn 库中 sklearn.metrics.pairwise 模,site

  • cosine_similarity(X, Y=None, dense_output=True)

  • cosine_distances(X, Y=None, *): paired_cosine_distances(X, Y)

    • equals 1 minus the cosine similarity
  • euclidean_distances(X, Y=None, *)paired_euclidean_distances(X, Y)

  • manhattan_distances(X, Y=None, *)paired_manhattan_distances(X, Y)

  • haversine_distances(X, Y=None): Haversine (or great circle) distance

    • Parameters: X, Y array-like of shape (n_samples_X, 2), the first coordinate of each point is the latitude, the second is the longitude, given in radians
  • nan_euclidean_distances(X, Y=None, *) ()

    • Calculate the euclidean distances in the presence of missing values.

    • dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates

    • Example: \(x_1=[3, \mathrm{NA}, \mathrm{NA}, 6]\), \(x_2=[1, \mathrm{NA}, 4, 5]\), \(\mathrm{dist}(x_1, x_2)=\sqrt{\dfrac{4}{2}[(3-1)^2 + (6-5)^2]}\)

通用形式:

  • metrics.pairwise_distances(X, Y=None, metric='euclidean')

    • 计算 XY 的距离矩阵
  • metrics.pairwise.paired_distances(X, Y, *, metric='euclidean')

    • 计算 XY 一一对应的样本距离,XX 尺寸必须相同
  • metric 参数:

    • From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1’', 'l2', 'manhattan']. These metrics support sparse matrix inputs. ['nan_euclidean'] but it does not yet support sparse matrices.

    • From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']

其他:kernel distance

4.1.3 scipy

scipy 库中 spatial.distance 模块中的 cdist() 函数

dist = scipy.spatial.distance.cdist(XA, XB, metric='euclidean', *)

References

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). "Section 14.3 Cluster Analysis" in The elements of statistical learning: Data mining, inference, and prediction (2nd ed). Springer.

周志华, "第 9 章 聚类", in 机器学习(第二版), 2016.

李航, "14.1 聚类的基本概念", 统计学习方法(第二版).

Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of artificial intelligence research, 6, 1-34.

Payne, T. R., & Edwards, P. (1998). Implicit feature selection with the value difference metric. In Proceedings of the European Conference on Artificial Intelligence-ECAI-98. Wiley.

scikit-learn documentation

  • Clustering performance evaluation, site

wikipedia:

  • Davies–Bouldin index, site

  • Rand index, site

  • Silhouette index, site

  • Dunn index, , site

posted @ 2022-06-24 11:39  veager  阅读(321)  评论(0)    收藏  举报