Cluster Analysis - Similarities and Measurements

1. 样本相似性度量 Similarity Measurement

1.1 定义

1.1.1 相似度与距离

相似度（Similarity）：衡量两个样本的相似程度
距离（Distance）或不相似度（dissimilarity）：距离越小，表示相似度越大

1.1.2 符号含义

样本 $\boldsymbol{x}_i, \boldsymbol{x}_j \in \mathcal{X} \subseteq \mathbb{R}^n$

$\boldsymbol{x}_i = \left[x_{i,1}, \cdots, x_{i,k}, \cdots, x_{i,K} \right]^{\top}$
样本属性（特征）：$k=1,2,\cdots,K$

1.2 距离度量 Distance measure

1.2.1 距离度量的基本性质

对于一个定义的“距离度量（distance measure）”，应当满足以下的基本性质：

非负性： $d(\boldsymbol{x}_i) \geq 0$
同一性： $d(\boldsymbol{x}_i, \boldsymbol{x}_j) = 0$ 当切仅当 $\boldsymbol{x}_i = \boldsymbol{x}_j$
对称性： $d(\boldsymbol{x}_i, \boldsymbol{x}_j) = d(\boldsymbol{x}_j, \boldsymbol{x}_i)$
直递性： $d(\boldsymbol{x}_i, \boldsymbol{x}_j) \leq d(\boldsymbol{x}_i, \boldsymbol{x}_k) + d(\boldsymbol{x}_k, \boldsymbol{x}_j)$

1.2.2 闵可夫斯基距离 Minkowski Distance

闵可夫斯基距离 Minkowski Distance： 样本 $\boldsymbol{x}_i$ 和样本 $\boldsymbol{x}_j$ 的闵可夫斯基距离（Minkowski Distance）定义为：

\[d_{i j} = \left(\sum_{k=1}^{K}\left|x_{i, k}-x_{j, k}\right|^{p}\right)^{\frac{1}{p}} \]

其中 $p\geq1$。

当 $p=1$ 时称为曼哈顿距离（Manhattan Distance）
当 $p=2$ 时称为欧式距离（Euclidean Distance）
当 $p=+\infty$ 时称为切比雪夫距离（Chebyshev Distance）：$d_{ij} = \max \limits_k |x_{i,k} - x_{j,k}|$

1.2.3 马哈拉诺比斯距离 Mahalanobis Distance

马哈拉诺比斯距离（Mahalanobis Distance）：是一种尺度无关的距离度量指标。给定一个样本集合 $\boldsymbol{X} = [x_{ik}]$，其协方差矩阵记作 $\boldsymbol{S}$。两个样本 $\boldsymbol{x}_i$ 和 $\boldsymbol{x}_j$ 之间的距离 $d_{ij}$ 定义为：

\[d_{i j}=\left[\left(x_{i}-x_{j}\right)^{\top} \boldsymbol{S}^{-1}\left(x_{i}-x_{j}\right)\right]^{\frac{1}{2}} \]

当 $\boldsymbol{S}$ 为单位矩阵时，即样本数据集的各个特征（属性）之间相互独立且方差均为 0 时，马哈拉诺比斯距离与欧式距离相同。

Haversine Distance

半矢正弦 (Haversine Distance) 或大圆距离（Great Circle distance）

\[\mathrm{dist}(x, y) = 2 r \arcsin \left[ \sqrt{ \sin^2 \left( \frac{x_1 - y_1}{2} \right) + \cos(x_1) \cos(y_1) \sin^2 \left( \frac{x_2 - y_2}{2} \right)} \right] \]

其中 $x_1$ 和 $y_1$ 表示维度（latitude），$x_2$ 和 $y_2$ 表示经度（longitude），单位为弧度（radian）
$r$ is the radius of the sphere (the earth)

1.3 相关性指标 Similarity Measure

1.3.1 相关系数 Correlation Coefficient

相关系数（Correlation Coefficient）： 两个样本 $\boldsymbol{x}_i$ 和 $\boldsymbol{x}_j$ 的相关系数 $r_{ij}$ 定义为：

\[r_{i j} =\frac{\sum \limits_{k=1}^{K}\left(x_{i, k}-\bar{\boldsymbol{x}}_{i}\right)\left(x_{j, k}-\bar{\boldsymbol{x}}_{j}\right)}{\left[\sum \limits_{k=1}^{K}\left(x_{i, k}-\bar{\boldsymbol{x}}_{i}\right)^{2} \cdot \sum \limits_{k=1}^{K}\left(x_{j, k}-\bar{\boldsymbol{x}}_{j}\right)^{2}\right]^{\frac{1}{2}}} \]

其中:

\[\bar{\boldsymbol{x}}_{i} =\frac{1}{K} \sum_{k=1}^{K} x_{i, k}, \quad \bar{\boldsymbol{x}}_{j} =\frac{1}{K} \sum_{k=1}^{K} x_{j, k} \]

相关系数 $r$ 的取值范围：$-1 \leq r \leq 1$
相关系数的绝对值越接近于 1，表示样本越相似；越接近于 0，表示样本越不相似。
关于相关系数的研究，参考博客：Correlation Analysis

1.3.2 余弦夹角 Cosine

余弦夹角（Cosine, or cosine similarity）： 两个样本 $\boldsymbol{x}_i$ 和 $\boldsymbol{x}_j$ 的相关系数 $s_{ij}$ 定义为：

\[s_{i j} = \frac{\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle}{\| \boldsymbol{x}_i \| \cdot \| \boldsymbol{x}_j \|} \\ = \frac{\sum \limits_{k=1}^{K} x_{i, k} \cdot x_{j, k }}{\sqrt{\sum \limits_{k=1}^{K} (x_{i, k})^{2} \cdot \sum \limits_{k=1}^{K} (x_{j, k})^{2}}} \]

其中，$\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle$ 表示向量 $\boldsymbol{x}_i$ 和向量 $\boldsymbol{x}_j$ 的内积（inner product），即 $\langle \boldsymbol{x}_i, \boldsymbol{x}_j \rangle = \boldsymbol{x}_i^{\top} \boldsymbol{x}_j$

余弦夹角 $s$ 的取值范围：$0 \leq r \leq 1$
相关系数的绝对值越接近于 1，表示样本越相似；越接近于 0，表示样本越不相似。

1.4 离散变量的距离度量

连续变量 continuous variable
离散变量 categorical variable
- 有序变量 ordinal variable
- 无序变量 non-ordinal (or nominal) variable

1.4.1 Ordinal categorical variable

Ordinal categorical variable 的例子：

满意度：非常不满意，比较不满意，一般，比较满意，非常满意
成绩：A，B，C，D，E，F

假设数据集 $\boldsymbol{X}$ 的第 $k$ 个属性值为 Ordinal categorical variable，可以将该属性的排序值（Rank）作为变量值，然后再当作连续变量处理，即：

\[x_{i,k}' = R(x_{i,k}), \quad \forall \ i \]

其中，$R(x_{i,k})$ 表示样本 $\boldsymbol{x}$ 的第 $k$ 个属性的属性值 $x_{i,k}$ 在所数据集中的排序值。

或者计算（Hastie et al, 2009）：

\[x_{i,k}' = \frac{R(x_{i,k})-\frac{1}{2}}{L}, \quad \forall \ i \]

其中，$L$ 表示数据集 $\boldsymbol{X}$ 中属性 $k$ 的种类总数。

1.4.2 Nominal variable

VDM (Value Difference Metric)：

假设数据集 $\mathbf{X}$ 的第 $k$ 个属性为离散变量，对于两个样本 $\boldsymbol{x}_i$ 和 $\boldsymbol{x}_j$ 其属性 $k$ 的值分别为 $a$ 和 $b$，即 $x_{i,k}=a, $x_{j,k}=b$；定义以下变量

$m_{k,a}$ 表示在属性 $k$ 上取值为 $a$ 的样本数
$m_{k,a,l}$ 表示在第 $l$ 个样本簇中在属性 $k$ 上取值为 $a$ 的样本数
$L$ 为样本簇数

\[\mathrm{VDM}_p(x_{i,k}, x_{j,k}) = \mathrm{VDM}_p(a, b) = \sum_{l=1}^{L} \left| \frac{m_{k,a,l}}{m_{k,a}} - \frac{m_{k,b,l}}{m_{k,b}} \right|^{p} \]

1.5 混合变量

闵可夫斯基距离和 VDM 结合。假设数据集有 $K_c$ 个连续变量，$K-K_c$ 个离散变量，则有

\[\text{MinkovDM}_{p} (\boldsymbol{x}_{i}, \boldsymbol{x}_{j}) = \left[\sum_{k=1}^{K_{c}} \left|x_{i, k}-x_{j, k} \right|^{p} + \sum_{k=K_{c}+1}^{K} \text{VDM}_{p} \left(x_{i, k}, x_{j, k} \right) \right]^{\frac{1}{p}} \]

其他方法：K-mode 和 k-prototypes: github

2. 类或簇

2.1 类的定义和属性

定义符号：

$G_i$ ：第 $i$ 类或簇（cluster）
$n_{G_i} = |G_i|$：表示类 $G_i$ 中样本的个数

2.1.1 类或簇（cluster）的定义

类或簇（cluster）： 设 $T$ 表示阈值，为一个给定的正数，若集合 $G$ 中任意两个样本 $\boldsymbol{x}_i$ 和 $\boldsymbol{x}_j$，有：

\[d_{ij} \leq T \]

则称 $G$ 为一个类或簇。

2.1.2 类的常用特征

类的均值（mean），或称为中心（centriod） $\bar{\boldsymbol{x}}_G$：

\[\text{cen}(G) = \mu_G = \bar{\boldsymbol{x}}_G = \frac{1}{n_G} \sum_{i=1}^{n_G} \boldsymbol{x}_i = \frac{1}{|G|} \sum_{i=1}^{|G|} \boldsymbol{x}_i \]

其中， $n_G$ 表示类 $G$ 的样本个数。

类的直径（diameter） $\mathrm{diam}(G)$：类 $G$ 中任意两个样本之间的最大距离：

\[\mathrm{diam}(G) = \max_{\boldsymbol{x}_i, \ \boldsymbol{x}_j \ \in \ G} \{ d_{ij} \} = \max_{1 \leq i < j \leq |G|} \{ d_{ij} \} \]

类的样本平均距离 $\mathrm{ave} (G)$：

\[\mathrm{ave} (G) = \frac{2}{|G|(|G|-1)} \sum_{1 \leq i < j \leq |G|} d_{ij} \]

类的样本散布矩阵（scatter matrix） $A_G$：

\[A_G = \sum_{i=1}^{n_G} (\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)(\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)^{\top} \]

类的样本协方差矩阵（covariance matrix） $S_G$：

\[S_G = \frac{1}{K-1} A_G = \frac{1}{K-1} \sum_{i=1}^{n_G} (\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)(\boldsymbol{x}_i - \bar{\boldsymbol{x}}_G)^{\top} \]

其中 $K$ 表示样本的维度（即，样本的属性个数）

2.2 类与类之间的距离度量

定义：

两个类 $G_p$ 和 $G_q$ 之间的距离 $D(p,q)$，也称为连接（linkage）。
$G_p$ 包含 $n_p$ 个样本，中心（均值）为 $\bar{\boldsymbol{x}}_p$
$G_q$ 包含 $n_q$ 个样本，中心（均值）为 $\bar{\boldsymbol{x}}_q$

最短距离或单连接（single or minimum linkage）：

\[d_{\min}(G_p, G_q) = \min_{\boldsymbol{x}_i \in G_p \ , \ \boldsymbol{x}_j \in G_q} \ \{ d_{ij} \} \]

最长距离或完全连接（complete or maximum linkage）：

\[d_{\max}(G_p, G_q) = \max_{\boldsymbol{x}_i \in G_p \ , \ \boldsymbol{x}_j \in G_q} \ \{ d_{ij} \} \]

中心距离 (centroid linkage)：

\[d_{\mathrm{cen}} (G_p, G_q) = d_{\mathrm{cen}} (\boldsymbol{\mu}_{p}, \boldsymbol{\mu}_{q}) = d_{\bar{\boldsymbol{x}}_p \ \bar{\boldsymbol{x}}_q} \]

平均距离 (average or mean linkage)：

\[d_{\mathrm{ave}}(G_p, G_q) = \frac{1}{n_p \ n_q} \sum_{\boldsymbol{x}_i \in G_p} \sum_{\boldsymbol{x}_j \in G_q} d_{ij} \]

3. 性能度量

优秀的聚类结果：“簇内相似度(intra-cluster similarity) ”高，且“簇间相似度"(inter-cluster similarity) 低”。

外部指标 (external index）： 将聚类结果与某个“参考模型(reference model)”进行比较。
内部指标 (internal index）： 直接考察聚类结果而不利用任何参考模型

定义符号：

$\mathcal{D} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_N\}$ 为数据集；$N$ 为样本总数
$\mathcal{C} = \{C_1, C_2, \cdots, C_K \}$：通过聚类给出的簇划分；$K$ 为聚类的聚类簇数
$\mathcal{C}^* = \{C_1^*, C_2^*, \cdots, C_S^* \}$：参考模型给出的簇划分；$S$ 为参考模型的聚类簇数
令 $\lambda$ 与 $\lambda^*$ 分别表示 $\mathcal{C}$ 和 $\mathcal{C}^*$ 对应的簇标记向量。
将样本两两配对，定义：

\[\begin{aligned} a=|S S|, \quad & S S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j} \right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ b=|S D|, \quad & S D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \\ c=|D S|, \quad & D S=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j \right\} \\ d=|D D|, \quad & D D=\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right) \mid \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j \right\} \end{aligned} \]
- 其中，集合 $SS$ 表示包含了在 $\mathcal{C}$ 中隶属于相同簇，且在 $\mathcal{C}^*$ 中也隶属于相同簇的样本对
- 集合 $SD$ 表示包含了在 $\mathcal{C}$ 中隶属于相同簇，但在 $\mathcal{C}^*$ 中也隶属于不同簇的样本对
- 集合 $DS$ 表示包含了在 $\mathcal{C}$ 中隶属于不同簇，但在 $\mathcal{C}^*$ 中也隶属于相同簇的样本对
- 集合 $DD$ 表示包含了在 $\mathcal{C}$ 中隶属于不同簇，且在 $\mathcal{C}^*$ 中也隶属于不同簇的样本对
- 由于每个样本对 $(\boldsymbol{x}_{i}, \boldsymbol{x}_{j})$ 仅能出现在一个集合中，因此有：

\[a+b+c+d = \binom {N}{2} = \frac{N(N-1)}{2} \]

使用 contingency table (Rand index, wikipedia, 2023)

\[\begin{array}{c|cccc|c} {{} \atop \mathcal{C}} \! \diagdown \!^{\mathcal{C}^*} & \mathcal{C}^*_{1} & \mathcal{C}^*_{2} & \cdots & \mathcal{C}^*_{S} &{\text{sums}} \\ \hline \mathcal{C}_{1} & n_{11} & n_{12} & \cdots & n_{1S} & a_{1} \\ \mathcal{C}_{2} & n_{21} & n_{22} & \cdots & n_{2S} & a_{2} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ \mathcal{C}_{K} & n_{K1} & n_{K2} & \cdots & n_{KS} & a_{K} \\ \hline {\text{sums}} & b_{1} & b_{2} & \cdots & b_{S} & N \end{array} \]

其中 $N = \sum_{i} a_i = \sum_{j} b_j$

3.0 总结

外部指标
内部指标

3.1 聚类性能度量的外部指标

(1) Jaccard 系数（Jaccard Coefficient，JC）

\[\mathrm{JC} = \frac{a}{a+b+c} \]

$\mathrm{JC} \in [0,1]$，值越大越好。

(2) FM 指数（Fowlkes and Mallows lndex，FMI）

\[\text{FMI} = \sqrt{\frac{a}{a+b} \cdot \frac{a}{a+c}} \]

Random (uniform) label assignments have a FMI score close to 0
Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement
- $\mathrm{FMI} \in [0, 1]$

(3) Rand 指数（Rand Index，RI）

\[\text{RI} = \frac{(a+d)}{N(N-1) / 2} = \frac{2 (a+d)}{N(N-1)} = \frac{a+d}{a+b+c+d} \]

Adjusted Rand Index (ARI) (wikipedia, 2023)

\[\mathrm{ARI} = \frac{ \text{RI} - \mathbb{E}[\text{RI}] }{ \max(\text{RI}) - \mathbb{E}[\text{RI}] } \]

Random (uniform) label assignments have an ARI score close to 0
RI does not guarantee that random label assignments will get a value close to zero (especially if the number of clusters is in the same order of magnitude as the number of samples)
Lower values indicate different labelings, and similar clusterings have a high RI or ARI. 1 is the perfect match score.
- $\mathrm{RI} \in [0, 1]$ and $\mathrm{ARI} \in [-1, 1]$

3.2 聚类性能度量的内部指标

3.2.1 CH Index

Calinski-Harabasz Index (CHI): defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:

\[\mathrm{CHI} = \frac{\mathrm{tr}(\mathbf{B}) / (K-1)}{\mathrm{tr}(\mathbf{W}) / (N-K)} = \frac{\mathrm{tr}(\mathbf{B})}{\mathrm{tr}(\mathbf{W})} \cdot \frac{N-K}{K-1} \]

where $\mathrm{tr}(\mathbf{B})$ is the trace of the between-cluster dispersion matrix
and $\mathrm{tr}(\mathbf{W})$ is the trace of the within-cluster dispersion matrix

\[\begin{aligned} \mathbf{W} &= \sum_{k=1}^K \mathbf{W}_k = \sum_{k=1}^K \sum_{ \boldsymbol{x}_i \in \mathcal{C}_k} (\boldsymbol{x}_i - \boldsymbol{c}_k) (\boldsymbol{x}_i - \boldsymbol{c}_k)^{\top} \\ \mathbf{B} &= \sum_{k=1}^K \mathbf{B}_k = \sum_{k=1}^K N_k (\boldsymbol{c}_k - \boldsymbol{c}_0) (\boldsymbol{c}_k - \boldsymbol{c}_0)^T \end{aligned} \]

where $\mathcal{C}_k$ is the set of samples in cluster $k$
$\boldsymbol{c}_k$ is the center (centroid) of cluster $k$
$\boldsymbol{c}_0$ is the center of whole dataset
$N_k$ the number of samples in cluster $k$

Properties

The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
$\mathrm{CHI} \in [0, +\infty)$

3.2.2 DB Index

DB 指数（Davies-Bouldin Index, DBI）（周志华, 2016）

\[\mathrm{DBI} = \frac{1}{K} \sum_{p=1}^{K} \ \max _{q, \ q \neq p} \left\{ \frac{\mathrm{avg}(C_{p}) + \mathrm{avg} (C_{q})}{d_{\mathrm{cen}} (C_{p}, C_{q})} \right\} \]

另一种计算方法（Wikipedia, 2022; scikit-learn, 2022):

\[\mathrm{DBI} = \frac{1}{K} \sum_{k=1}^K \max_{k \neq s} R(\mathcal{C}_k, \mathcal{C}_s) \]

其中，$R(\mathcal{C}_k, \mathcal{C}_s)$ 衡量两个簇 $\mathcal{C}_k$ 和 $\mathcal{C}_s$ 之间的距离；是非负数并且是对称的，即：$R(\mathcal{C}_k, \mathcal{C}_s) = R(\mathcal{C}_s, \mathcal{C}_k) \geq 0$

\[R(\mathcal{C}_k, \mathcal{C}_s) = \frac{S(\mathcal{C}_k) + S(\mathcal{C}_s)} {d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)} \]
其中 $S(\mathcal{C}_k)$ 表示 $\mathcal{C}_k$ 簇中所有样本到簇中心 $\mathrm{cen} (\mathcal{C}_k)$ 的平均距离：

\[S(\mathcal{C}_k) = \frac{1}{|\mathcal{C}_k|} \sum_{\boldsymbol{x}_i \in \mathcal{C}_k} \text{dist} \big( \boldsymbol{x}_i, \mathrm{cen} (\mathcal{C}_k) \big) \]
$d_{\text{cen}}(\mathcal{C}_k, \mathcal{C}_s)$ 表示两个簇中心之间的距离：

\[d_{\mathrm{cen}} (\mathcal{C}_k, \mathcal{C}_s) = d_{\mathrm{cen}} (\bar{\boldsymbol{x}}_k, \bar{\boldsymbol{x}}_s) \]
The DBI is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.
The usage of centroid distance limits the distance metric to Euclidean space.

3.2.3 Dunn Index

Dunn 指数（Dunn Index, DI）DI for cluster $\mathcal{C}_k$ (wikipedia, Dunn index, 2023)

\[\mathrm{DI}_{k} = \frac{ \min \limits_{1 \leq p < q \leq k} \delta( \mathcal{C}_{p}, \mathcal{C}_{q}) }{ \max \limits_{1 \leq p \leq k} \Delta (\mathcal{C}_p) } \]

where $\delta( \mathcal{C}_{i}, \mathcal{C}_{j})$ indicates the inter-cluster distance between $\mathcal{C}_{i}$ and $\mathcal{C}_{j}$; $\Delta (\mathcal{C}_k)$ indicates the intra-cluster distance of cluster $\mathcal{C}_k$.

The DI index over the all clusters (周志华, 2016, pp. 199 & jqmviegas, 2023):

\[\mathrm{DI} = \frac{ \min \limits_{k, s} d_{\min } ( \mathcal{C}_{k}, \mathcal{C}_{s}) }{ \max \limits_{k} \ \mathrm{diam} ( \mathcal{C}_k ) } \]

DI 值越大越好

3.2.4 Silhouette Index

For each data point $\boldsymbol{x}_i$, the Silhouette index (SI) is computed as:

\[\begin{aligned} &s(i) = {\frac {b(i)-a(i)}{ \max\{a(i),b(i)\}}}, && \text{If} \quad |\mathcal{C}_{I}|>1 \\ &s(i) = 0, && \text{If} \quad |\mathcal{C}_{I}|=1, \text{ i.e. } \mathcal{C}_I = \{ \boldsymbol{x}_i \} \end{aligned} \]

where $a(i)$ is the mean distance between the sample $\boldsymbol{x}_i$ and all other points in the same class (denoted as $\mathcal{C}_I$);
$b(i)$ is the mean distance between a sample $\boldsymbol{x}_i$ and all other points in the next nearest cluster

\[\begin{aligned} a(i) &= {\frac {1}{ |\mathcal{C}_I|-1}}\sum _{j \in \mathcal{C}_{I}, i\neq j} d(\boldsymbol{x}_i, \boldsymbol{x}_j) \qquad \text{where} \quad \boldsymbol{x}_i \in \mathcal{C}_{I} \\ b(i) &= \min _{J \neq I}{\frac {1}{|\mathcal{C}_{J}|}}\sum _{j\in \mathcal{C}_{J}} d(\boldsymbol{x}_i, \boldsymbol{x}_j) \end{aligned} \]

The overall Silhouette index for the whole dataset:

Simple mean over the all samples, implemented by sklearn.metrics.silhouette_score

\[\mathrm{SI} = \frac{1}{N} \sum_{i} s(i) \]
Silhouette coefficient (Kaufman et al.): the maximum value of the mean $s(i)$ over all data of the entire dataset

\[\mathrm{SI} = \max_{k} \bar{s}(k) = \max_{k} \sum_{\boldsymbol{x}_i \in \mathcal{C}_{k}} {s}(i) \]
- where where $\bar{s}(k)$ represents the mean $s(i)$ over all data of the entire dataset for a specific number of clusters $k$.

Properites

a higher SI score relates to a model with better defined clusters.
- $s(i) \in [-1, 1]$
- $-1$ indicates incorrect clustering and $+1$ indicates highly dense clustering.
- Scores around $0$ indicate overlapping clusters
- The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

4 Python 中实现

4.1 距离度量

4.1.2 `sklearn` 库

sklearn 库中 sklearn.metrics.pairwise 模，site

cosine_similarity(X, Y=None, dense_output=True)
cosine_distances(X, Y=None, *): paired_cosine_distances(X, Y)
- equals 1 minus the cosine similarity
euclidean_distances(X, Y=None, *) 和 paired_euclidean_distances(X, Y)
manhattan_distances(X, Y=None, *) 和 paired_manhattan_distances(X, Y)
haversine_distances(X, Y=None): Haversine (or great circle) distance
- Parameters: X, Y array-like of shape (n_samples_X, 2), the first coordinate of each point is the latitude, the second is the longitude, given in radians
nan_euclidean_distances(X, Y=None, *) ()
- Calculate the euclidean distances in the presence of missing values.
- dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates
- Example: $x_1=[3, \mathrm{NA}, \mathrm{NA}, 6]$, $x_2=[1, \mathrm{NA}, 4, 5]$, $\mathrm{dist}(x_1, x_2)=\sqrt{\dfrac{4}{2}[(3-1)^2 + (6-5)^2]}$

通用形式:

metrics.pairwise_distances(X, Y=None, metric='euclidean')
- 计算 X 与 Y 的距离矩阵
metrics.pairwise.paired_distances(X, Y, *, metric='euclidean')
- 计算 X 与 Y 一一对应的样本距离，X 与 X 尺寸必须相同
metric 参数：
- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1’', 'l2', 'manhattan']. These metrics support sparse matrix inputs. ['nan_euclidean'] but it does not yet support sparse matrices.
- From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']

其他：kernel distance

kernel distance

4.1.3 `scipy` 库

scipy 库中 spatial.distance 模块中的 cdist() 函数

dist = scipy.spatial.distance.cdist(XA, XB, metric='euclidean', *)

References

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). "Section 14.3 Cluster Analysis" in The elements of statistical learning: Data mining, inference, and prediction (2nd ed). Springer.

周志华, "第 9 章聚类", in 机器学习（第二版）, 2016.

李航, "14.1 聚类的基本概念", 统计学习方法（第二版）.

Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of artificial intelligence research, 6, 1-34.

Payne, T. R., & Edwards, P. (1998). Implicit feature selection with the value difference metric. In Proceedings of the European Conference on Artificial Intelligence-ECAI-98. Wiley.

scikit-learn documentation

Clustering performance evaluation, site

wikipedia:

Davies–Bouldin index, site
Rand index, site
Silhouette index, site
Dunn index, , site

posted @ 2022-06-24 11:39 veager 阅读(338) 评论(0) 收藏举报

刷新页面返回顶部

veager