# 《数据分析实战-托马兹.卓巴斯》读书笔记第4章-聚类技巧（K均值、BIRCH、DBSCAN）

python数据分析个人学习读书笔记-目录索引

·评估聚类方法的表现
·用k均值算法聚类数据
·为k均值算法找到最优的聚类数
·使用mean shift聚类模型发现聚类
·使用c均值构建模糊聚类模型
·使用层次模型聚类数据
·使用DBSCAN和BIRCH算法发现潜在的订阅者

4.1导论

4.2评估聚类方法的表现

def printClustersSummary(data, labels, centroids):
'''
Helper method to automate models assessment
'''
print('Pseudo_F: ', pseudo_F(data, labels, centroids))
print('Davis-Bouldin: ',
davis_bouldin(data, labels, centroids))
print('Silhouette score: ',
mt.silhouette_score(data, np.array(labels),
metric='euclidean'))

 1 def pseudo_F(X, labels, centroids):
2     '''
3         The pseudo F statistic :
4         pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])]
5         The pseudo F statistic was suggested by
6         Calinski and Harabasz (1974) in
7         Calinski, T. and J. Harabasz. 1974.
8             A dendrite method for cluster analysis.
9             Commun. Stat. 3: 1-27.
10             http://dx.doi.org/10.1080/03610927408827101
11
12         We borrowed this code from
13         https://github.com/scampion/scikit-learn/blob/master/
14         scikits/learn/cluster/__init__.py
15
16         However, it had an error so we altered how B is
17         calculated.
18     '''
19     center = np.mean(X,axis=0)
20     u, count = np.unique(labels, return_counts=True)
21
22     B = np.sum([count[i] * ((cluster - center)**2)
23         for i, cluster in enumerate(centroids)])
24
25     X = X.as_matrix()
26     W = np.sum([(x - centroids[labels[i]])**2
27                  for i, x in enumerate(X)])
28
29     k = len(centroids)
30     n = len(X)
31
32     return (B / (k-1)) / (W / (n-k))

 1  def davis_bouldin(X, labels, centroids):
2     '''
3         The Davis-Bouldin statistic is an internal evaluation
4         scheme for evaluating clustering algorithms. It
5         encompasses the inter-cluster heterogeneity and
6         intra-cluster homogeneity in one metric.
7
8         The measure was introduced by
9         Davis, D.L. and Bouldin, D.W. in 1979.
10             A Cluster Separation Measure
11             IEEE Transactions on Pattern Analysis and
12             Machine Intelligence, PAMI-1: 2, 224--227
13
14             http://dx.doi.org/10.1109/TPAMI.1979.4766909
15     '''
16     distance = np.array([
17         np.sqrt(np.sum((x - centroids[labels[i]])**2))
18         for i, x in enumerate(X.as_matrix())])
19
20     u, count = np.unique(labels, return_counts=True)
21
22     Si = []
23
24     for i, group in enumerate(u):
25         Si.append(distance[labels == group].sum() / count[i])
26
27     Mij = []
28
29     for centroid in centroids:
30         Mij.append([
31             np.sqrt(np.sum((centroid - x)**2))
32             for x in centroids])
33
34     Rij = []
35     for i in range(len(centroids)):
36         Rij.append([
37             0 if i == j
38             else (Si[i] + Si[j]) / Mij[i][j]
39             for j in range(len(centroids))])
40
41     Di = [np.max(elem) for elem in Rij]
42
43     return np.array(Di).sum() / len(centroids)

Si衡量的是聚类内部的同质性，即是每个观测值到聚类中心的平均距离。Mij计算各个聚类中心之间的几何距离，量化了聚类之间的异质性。Rij衡量的是两个聚类之间切分得好不好，Di选取了最差情况下的切分。Davis-Bouldin指标是Di的平均值。
Silhouette分值（指数）是又一个聚类内部评价指标。我们没有实现计算程序（因为Scikit已经提供了实现），但我们简要描述了它的计算过程与衡量目标。一个聚类，其Silhouette分值衡量的是聚类中每个观测值与其他观测值之间的平均距离，将这个值与每个聚类内点与点之间的平均距离相比。理论上，这个指标的取值范围在-1到1之间；-1意味着所有的观测值都在不合适的聚类中，实际上，更合适的聚类可能就是临近的某个。尽管理论上可行，但实际使用中，你更可能完全不考虑负的Silhouette分值。光谱的另一端，1代表完美的切分，所有的观测值都落在合适的聚类中。这个值如果是0，代表着聚类之间一个完美重叠，意味着每个聚类都有应该归属于其他聚类的观测值，并且这些观测值的数目是相等的。

4.3用k均值算法聚类数据
k均值聚类算法可能是在聚类向量化数据方面最为知名的数据挖掘技术了。它基于观测察值之间的相似度，将它们分到各个聚类；决定因素就是观测值和最近一个聚类中心的欧几里得距离。

Scikit在其聚类子模块中提供了多种聚类模型。这里，我们使用.KMeans（...）来估算聚类模型（clustering_kmeans.py文件）：

def findClusters_kmeans(data):
'''
Cluster data using k-means
'''
# create the classifier object
kmeans = cl.KMeans(
n_clusters=4,
n_jobs=-1,
verbose=0,
n_init=30
)

# fit the data
return kmeans.fit(da */ta)    

# the file name of the dataset
r_filename = '../../Data/Chapter04/bank_contacts.csv'

# read the data

# select variables
'prev_ctc_outcome_success','n_euribor3m',
'n_cons_conf_idx','n_age','month_oct',
'n_cons_price_idx','edu_university_degree','n_pdays',
'dow_mon','job_student','job_technician',
'job_housemaid','edu_basic_6y']]

Tips:

/* The method findClusters_kmeans took 2.32 sec to run.
D:\Java2018\practicalDataAnalysis\helper.py:142: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
X = X.as_matrix()
Pseudo_F:  11515.72135543927
D:\Java2018\practicalDataAnalysis\helper.py:168: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
for i, x in enumerate(X.as_matrix())])
Davis-Bouldin:  1.2627990546726073
Silhouette score:  0.3198322783627837
*/

/*
The method findClusters_kmeans took 2.32 sec to run.
Pseudo_F:  11349.586839983096
Davis-Bouldin:  1.2608899499519972
*/

Scikit的.KMeans（...）方法有多个参数。参数n_clusters定义了期待数据中要有多少聚类，以及要返回多少聚类。在本书4.4节中，我们将写一个循环方法，为k均值算法找出最优的聚类数。
n_jobs参数是你机器上并行执行的任务数；指定-1意味着将并行执行的数目指定为处理器核的数目。你可以显式传入一个整数，比如8，来指定任务数目。
verbose参数控制着估算阶段你能看到多少信息；设置为1会打印出所有关于估算的细节。
n_init参数控制着要估算多少模型。k均值算法的每一次执行，会随机选择每个聚类的中心，循环调整这些中心，直到模型聚类之间的隔离程度和聚类内部的相似程度达到最好。.KMeans（...）方法以不同的初始条件（为各中心随机化初始种子），构建n_init参数指定的数目的模型，然后选择准则表现最好的一个。

Scikit当然不是估算k均值模型的唯一方法；我们也可以使用SciPy（clustering_kmeans_alternative.py文件）：

 1 def findClusters_kmeans(data):
2     '''
3         Cluster data using k-means
4     '''
5     # whiten the observations
6     data_w = vq.whiten(data)
7
8     # create the classifier object
9     kmeans, labels = vq.kmeans2(
10         data_w,
11         k=4,
12         iter=30
13     )
14
15     # fit the data
16     return kmeans, labels
17
18 # the file name of the dataset
19 r_filename = '../../Data/Chapter04/bank_contacts.csv'
20
21 # read the data
23
24 # select variables
25 selected = csv_read[['n_duration','n_nr_employed',
26         'prev_ctc_outcome_success','n_euribor3m',
27         'n_cons_conf_idx','n_age','month_oct',
28         'n_cons_price_idx','edu_university_degree','n_pdays',
29         'dow_mon','job_student','job_technician',
30         'job_housemaid','edu_basic_6y']]
31
32 # cluster the data
33 #centroids, labels = findClusters_kmeans(selected.as_matrix())
34 centroids, labels = findClusters_kmeans(selected.values)
35
36 hlp.printClustersSummary(selected, labels, centroids)

4.4为k均值算法找到最优的聚类数

# find the optimal number of clusters; that is, the number of
# clusters that minimizes the Davis-Bouldin criterion
optimal_n_clusters = findOptimalClusterNumber(selected)

findOptimalClusterNumber（...）方法定义如下：

 1 def findOptimalClusterNumber(
2         data,
3         keep_going = 1,
4         max_iter = 30
5     ):
6     '''
7         A method that iteratively searches for the
8         number of clusters that minimizes the Davis-Bouldin
9         criterion
10     '''
11     # the object to hold measures
12     measures = [666]
13
14     # starting point
15     n_clusters = 2
16
17     # counter for the number of iterations past the local
18     # minimum
19     #超过局部最小值的循环次数的计数器
20     keep_going_cnt = 0
21     stop = False   # flag for the algorithm stop
22
23     def checkMinimum(keep_going):
24         '''
25             A method to check if minimum found
26         '''
27         global keep_going_cnt # access global counter
28
29         # if the new measure is greater than for one of the
30         # previous runs
31         if measures[-1] > np.min(measures[:-1]):
32             # increase the counter
33             keep_going_cnt += 1
34
35             # if the counter is bigger than allowed
36             if keep_going_cnt > keep_going:
37                 # the minimum is found
38                 return True
39         # else, reset the counter and return False
40         else:
41             keep_going_cnt = 0
42
43         return False
44
45     # main loop
46     # loop until minimum found or maximum iterations（循环） reached
47     while not stop and n_clusters < (max_iter + 2):
48         # cluster the data
49         cluster = findClusters_kmeans(data, n_clusters)
50
51         # assess the clusters effectiveness
52         labels = cluster.labels_
53         centroids = cluster.cluster_centers_
54
55         # store the measures（指标）
56         measures.append(
57             hlp.davis_bouldin(data,labels, centroids)
58         )
59
60         # check if minimum found
61         stop = checkMinimum(keep_going)
62
63         # increase the iteration
64         n_clusters += 1
65
66     # once found -- return the index of the minimum
67     return measures.index(np.min(measures)) + 1

max_iter参数指定了最多构建多少个模型；我们的方法以n_clusters=2构建k均值模型开始，然后循环执行，直到找到一个全局最小值或是达到了最大的循环次数。

measures列表用于存储估算模型的Davis-Bouldin指标；由于我们的目标是找到最小的Davis-Bouldin指标，我们指定了一个大数，以免这个值成了我们的最小值。
n_clusters定义了起始点：第一个k均值模型将数据分成两个聚类。
keep_going_cnt用来指定Davis-Bouldin指标比之前的最小值要大时，还要走多少个循环。以前面展示的图为例，我们的方法不会终止在某个局部最小值；尽管循环10和循环13的Davis-Bouldin指标（分别）比循环9和循环12小，但是聚类数为11和14的模型有更低的值。在这个例子中，我们在聚类数为15和16时终止，此时Davis-Bouldin指标更大。

# cluster the data
cluster = findClusters_kmeans(data, n_clusters)

 1 def checkMinimum(keep_going):
2     '''
3         A method to check if minimum found
4     '''
5     global keep_going_cnt # access global counter
6
7     # if the new measure is greater than for one of the
8     # previous runs
9     if measures[-1] > np.min(measures[:-1]):
10         # increase the counter
11         keep_going_cnt += 1
12
13         # if the counter is bigger than allowed
14         if keep_going_cnt > keep_going:
15             # the minimum is found
16             return True
17     # else, reset the counter and return False
18     else:
19         keep_going_cnt = 0
20
21     return False

# cluster the data
cluster = findClusters_kmeans(selected, optimal_n_clusters)

# assess the clusters effectiveness评估聚类的有效性
labels = cluster.labels_
centroids = cluster.cluster_centers_

hlp.printClustersSummary(selected, labels, centroids)
/*

The method findClusters_kmeans took 1.97 sec to run.
The method findClusters_kmeans took 0.56 sec to run.
The method findClusters_kmeans took 0.76 sec to run.
The method findClusters_kmeans took 0.90 sec to run.
The method findClusters_kmeans took 1.01 sec to run.
The method findClusters_kmeans took 1.16 sec to run.
The method findClusters_kmeans took 1.31 sec to run.
The method findClusters_kmeans took 1.34 sec to run.
The method findClusters_kmeans took 1.55 sec to run.
The method findClusters_kmeans took 1.69 sec to run.
The method findClusters_kmeans took 1.82 sec to run.
The method findClusters_kmeans took 2.08 sec to run.
The method findClusters_kmeans took 2.11 sec to run.
The method findClusters_kmeans took 2.40 sec to run.
The method findClusters_kmeans took 2.49 sec to run.
The method findClusters_kmeans took 3.03 sec to run.
The method findClusters_kmeans took 2.98 sec to run.
The method findClusters_kmeans took 3.05 sec to run.
Optimal number of clusters: 17
The method findClusters_kmeans took 2.96 sec to run.
Pseudo_F:  11493.774263351304
Davis-Bouldin:  0.9516597767023216
*/

 1 def plotInteractions(data, n_clusters):
2     '''
3         Plot the interactions between variables
4     '''
5     # cluster the data
6     cluster = findClusters_kmeans(data, n_clusters)
7
8     # append the labels to the dataset for ease of plotting
9     data['clus'] = cluster.labels_
10
11     # prepare the plot
12     ax = sns.pairplot(selected, hue='clus')
13
14     # and save the figure
15     ax.savefig(
16         '../../Data/Chapter04/k_means_{0}_clusters.png' \
17         .format(n_clusters)
18     )

14个聚类的结果（为了可读性，限定3个特征），看起来类似这样：

Tips:

/*
The method findClusters_kmeans took 3.43 sec to run.
D:\Java2018\practicalDataAnalysis\Codes\Chapter04\clustering_kmeans_search_alternative.py:37: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['clus'] = cluster.labels_
D:\tools\Python37\lib\site-packages\statsmodels\nonparametric\kde.py:487: RuntimeWarning: invalid value encountered in true_divide
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
D:\tools\Python37\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
FAC1 = 2*(np.pi*bw/RANGE)**2
*/


/*
Compare these two access methods:
In [340]: dfmi['one']['second']
Out[340]:
0    b
1    f
2    j
3    n
Name: second, dtype: object
In [341]: dfmi.loc[:, ('one', 'second')]
Out[341]:
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object
*/

mean shift模型是一个类似于寻找中心（或最大密度）的方法。与k均值不同，这个方法不要求指定聚类数——这个模型根据数据中找到的密度中心的数目返回聚类的数目。

 1 def findClusters_meanShift(data):
2     '''
3         Cluster data using Mean Shift method
4         使用Mean Shift方法聚类数据
5     '''
6     bandwidth = cl.estimate_bandwidth(data,
7         quantile=0.25, n_samples=500)
8
9     # create the classifier object
10     meanShift = cl.MeanShift(
11         bandwidth=bandwidth,
12         bin_seeding=True
13     )
14
15     # fit the data
16     return meanShift.fit(data)
/*
The method findClusters_meanShift took 20.53 sec to run.
Pseudo_F:  1512.7080324697126
Davis-Bouldin:  1.736359716302394
Silhouette score:  0.2172934930829158
*/

estimate_bandwidth（...）方法至少要传入数据作为参数。

quantile参数决定了从什么地方切分传入.MeanShift（...）方法的核的样本。
quantile取0.5就意味着中位数。

4.6

k均值和mean shift聚类算法将观测值归到截然不同的聚类：一个观测值可以且只能被归到一个相似样本的聚类中。对于离散可分的数据集，这个也许是对的，但数据如果有重叠，将它们放进同一个桶里就很困难了。毕竟，我们的世界并不是非黑即白，我们能看到无数色彩。
c均值聚类模型允许每个观测值属于不止一个聚类，各从属关系有一个权重：对每个观测值来说，它所属的所有聚类对应的权重之和必须为1。

>pip install Scikit-Fuzzy

/* Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
.......
Successfully built Scikit-Fuzzy
Installing collected packages: decorator, networkx, Scikit-Fuzzy
Successfully installed Scikit-Fuzzy-0.4.2 decorator-4.4.1 networkx-2.4
FINISHED
*/

Scikit已经提供估算c均值的算法。调用方法和之前的略有不同（clustering_cmeans.py文件）：

 1 import skfuzzy.cluster as cl
2 import numpy as np
3
4 @hlp.timeit
5 def findClusters_cmeans(data):
6     '''
7         Cluster data using fuzzy c-means clustering
8         algorithm
9     '''
10     # create the classifier object
11     return cl.cmeans(
12         data,
13         c = 5,          # number of clusters 聚类数
14         m = 2,          # exponentiation factor 指数因子
15
16         # stopping criteria
17         error = 0.01,
18         maxiter = 300
19     )

c参数指定了要应用多少个聚类，m参数指定了在每次循环时应用到成员关系函数的指数因子。

/*
The method findClusters_cmeans took 2.25 sec to run.
[[0.20158559 0.09751069 0.07842359 ... 0.16641284 0.16997486 0.18780077]
[0.14041632 0.05272408 0.04176255 ... 0.25766411 0.13334598 0.13312636]
[0.15019893 0.05824498 0.04623189 ... 0.14150986 0.2692789  0.14128872]
[0.13702899 0.05074051 0.04020164 ... 0.28432326 0.27960719 0.38820815]
[0.37077018 0.74077974 0.79338032 ... 0.15008993 0.14779308 0.14957599]]
Pseudo_F:  8340.883112129994
Davis-Bouldin:  1.3062853046748828
Silhouette score:  0.35108693925427953
*/

 # assess the clusters effectiveness
labels = [
np.argmax(elem) for elem in u.transpose()
]

4.7

>pip install pylab

/* Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
......
Successfully built pyusb PyVISA
Installing collected packages: pyusb, PyVISA, PyVISA-py, pylab
Successfully installed PyVISA-1.10.1 PyVISA-py-0.3.1 pylab-0.0.2 pyusb-1.0.2
FINISHED */

1 import pylab as pl
3     '''
4         Cluster data using single linkage hierarchical
5         clustering
6         使用单链接分层聚类数据
7     '''
8     # return the linkage object
9     return cl.linkage(data, method='single')

 1 ####import pylab as pl
2 import matplotlib.pylab as pl
3
4 # cluster the data
5 cluster = findClusters_ward(selected)
6
7 # plot the clusters
8 fig  = pl.figure(figsize=(16,9))
9 ax   = fig.add_axes([0.1, 0.1, 0.8, 0.8])
10 dend = cl.dendrogram(cluster, truncate_mode='level', p=20)
11 ax.set_xticks([])
12 ax.set_yticks([])
13
14 # save the figure
15 fig.savefig(
16     '../../Data/Chapter04/hierarchical_dendrogram.png',
17     dpi=300
18 )

 1 def findClusters_ward(data):
2     '''
3         Cluster data using Ward's hierarchical clustering
4     '''
5     # create the classifier object
6     ward = ml.MFastHCluster(
7         method='ward'
8     )
9
10     # fit the data
12
13     return ward
14
15 # the file name of the dataset
16 r_filename = '../../Data/Chapter04/bank_contacts.csv'
17
18 # read the data
20
21 # select variables
22 selected = csv_read[['n_duration','n_nr_employed',
23         'prev_ctc_outcome_success','n_euribor3m',
24         'n_cons_conf_idx','n_age','month_oct',
25         'n_cons_price_idx','edu_university_degree','n_pdays',
26         'dow_mon','job_student','job_technician',
27         'job_housemaid','edu_basic_6y']]
28
29 # cluster the data
30 cluster = findClusters_ward(selected)
31
32 # assess the clusters effectiveness
33 labels = cluster.cut(20)
34 centroids = hlp.getCentroids(selected, labels)

 1 def getCentroids(data, labels):
2     '''
3         Method to get the centroids of clusters in clustering
4         models that do not return the centroids explicitly
5     '''
6     # create a copy of the data
7     data = data.copy()
8
9     # apply labels
10     data['predicted'] = labels
11
12     # and return the centroids
13     return np.array(data.groupby('predicted').agg('mean')) 

4.8使用DBSCAN和BIRCH算法发现潜在的订阅者

DBSCAN（Density-based Spatial Clustering of Applications with Noise，基于密度的空间聚类）和BIRCH（Balanced Iterative Reducing and Clustering using Hierarchies，利用层次方法的平衡迭代规约和聚类）是首批能有效处理带噪音的数据的算法。这里的噪音可以理解为，数据集中与其他部分相比，所放地方不恰当的点；DBSCAN将这种观测值放到一个未归类的桶，而BIRCH认为它们是离群值，直接挪走。

 1 import sklearn.cluster as cl
2 import numpy as np
3
4 @hlp.timeit
5 def findClusters_DBSCAN(data):
6     '''
7         Cluster data using DBSCAN algorithm 使用DBScan算法 聚类数据
8     '''
9     # create the classifier object
10     dbscan = cl.DBSCAN(eps=1.2, min_samples=200)
11
12     # fit the data
13     return dbscan.fit(data)
14
15 # cluster the data
16 cluster = findClusters_DBSCAN(selected)
17
18 # assess the clusters effectiveness
19 labels = cluster.labels_ + 1
20 centroids = hlp.getCentroids(selected, labels)

 1 import sklearn.cluster as cl
2 import numpy as np
3
4 @hlp.timeit
5 def findClusters_Birch(data):
6     '''
7         Cluster data using BIRCH algorithm
8     '''
9     # create the classifier object
10     birch = cl.Birch(
11         branching_factor=100,
12         n_clusters=4,
13         compute_labels=True,
14         copy=True
15     )
16
17     # fit the data
18     return birch.fit(data)
19
20
21 # cluster the data
22 cluster = findClusters_Birch(selected)
23
24 # assess the clusters effectiveness
25 labels = cluster.labels_
26 centroids = hlp.getCentroids(selected, labels)
/*
The method findClusters_Birch took 0.99 sec to run.
Pseudo_F:  1733.8811847551842
Davis-Bouldin:  1.2605820279777722
*/

BIRCH算法直接丢弃离群值。如果你不希望方法修改原始数据，可以将copy参数设置为True：这会复制原始数据。

DBSCAN的论文：https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
BIRCH的论文：http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf

第4章完。

python数据分析个人学习读书笔记-目录索引

posted @ 2020-03-16 15:06  邀月  阅读(1267)  评论(0编辑  收藏  举报