学习数据mining算法收集（1）聚类算法：K-means算法

————————非原创，用来学习，转载https://zhuanlan.zhihu.com/p/77040610——————

1.定义

K-means 算法中的 k

初始化 K
根据所得簇集，得到新的簇集中心；同时计算新的距离和
计算
根据所得簇集，得到新的簇集中心；同时计算新的距离和 J1

　　流程：

3.算法实现

　　numpy实现K-means算法

 1 import numpy as np
 2 import random
 3 import time
 4 import matplotlib.pyplot as plt
 5 from scipy.spatial.distance import cdist
 6 # 计算两个矩阵的距离矩阵
 7 def compute_distances_no_loops(A, B):
 8     return cdist(A,B,metric='euclidean')
 9     
10 # 显示簇集，如果簇集类别大于6类，需要增加colorMark的内容
11 def plotFeature(data, labels_):
12     clusterNum=len(set(labels_))
13     fig = plt.figure()
14     scatterColors = ['black', 'blue', 'green', 'yellow', 'red', 'purple', 'orange', 'brown','#BC8F8F','#8B4513','#FFF5EE']
15     ax = fig.add_subplot(111)
16     for i in range(-1,clusterNum):
17         colorSytle = scatterColors[i % len(scatterColors)]
18         subCluster = data[np.where(labels_==i)]
19         ax.scatter(subCluster[:,0], subCluster[:,1], c=colorSytle, s=20)
20     plt.show()
21 
22 # 聚类算法的实现
23 # 需要聚类的数据data
24 # K 聚类的个数
25 # tol 聚类的容差，即ΔJ
26 # 聚类迭代都最大次数N
27 def K_means(data,K,tol,N):
28     #一共有多少条数据
29     n = np.shape(data)[0]
30     # 从n条数据中随机选择K条，作为初始中心向量
31     # centerId是初始中心向量的索引坐标
32     centerId = random.sample(range(0, n), K)
33     # 获得初始中心向量,k个
34     centerPoints = data[centerId]
35     # 计算data到centerPoints的距离矩阵
36     # dist[i][:],是i个点到三个中心点的距离
37     dist = compute_distances_no_loops(data, centerPoints)
38     # axis=1寻找每一行中最小值都索引
39     # squeeze()是将label压缩成一个列表
40     labels = np.argmin(dist, axis=1).squeeze()
41     # 初始化old J
42     oldVar = -0.0001
43     # data - centerPoint[labels]，获得每个向量与中心向量之差
44     # np.sqrt(np.sum(np.power(data - centerPoint[labels], 2)，获得每个向量与中心向量距离
45     # 计算new J
46     newVar = np.sum(np.sqrt(np.sum(np.power(data - centerPoints[labels], 2), axis=1)))
47     # 迭代次数
48     count=0
49     # 当ΔJ大于容差且循环次数小于迭代次数，一直迭代。负责结束聚类
50     # abs(newVar - oldVar) >= tol:
51     while count<N and abs(newVar - oldVar) > tol:
52         oldVar = newVar
53         for i in range(K):
54             # 重新计算每一个类别都中心向量
55             centerPoints[i] = np.mean(data[np.where(labels == i)], 0)
56         # 重新计算距离矩阵
57         dist = compute_distances_no_loops(data, centerPoints)
58         # 重新分类
59         labels = np.argmin(dist, axis=1).squeeze()
60         # 重新计算new J
61         newVar = np.sum(np.sqrt(np.sum(np.power(data - centerPoints[labels], 2), axis=1)))
62         # 迭代次数加1
63         count+=1
64     # 返回类别标识，中心坐标
65     return labels,centerPoints
66 starttime = time.clock()
67 data = np.loadtxt("cluster.csv", delimiter=",")
68 labels,_=K_means(data,3,0.01,100)
69 endtime = time.clock()
70 print(endtime - starttime)
71 plotFeature(data, labels)

　　使用scikit-learn实现K-means算法

 1 import numpy as np
 2 from sklearn.cluster import KMeans
 3 # 加载数据
 4 data = np.loadtxt("cluster.csv", delimiter=",")
 5 # 构造一个聚类数为3的聚类器
 6 estimator = KMeans(n_clusters=3,max_iter=100,tol=0.001)
 7 # 实现聚类结果
 8 estimator.fit(data)
 9 # 获取聚类标签
10 label_pred = estimator.labels_
11 # 获取聚类中心
12 centroids = estimator.cluster_centers_
13 # 获取聚类准则的总和
14 inertia = estimator.inertia_

5.K-means优缺点

　　优点：

　　　　算法简单易实现；

　　缺点：

　　　　需要用户事先指定类簇个数K；

　　　　**对异常点敏感，一个特大都值，或者极小的值，会影响均值的数值**

　　　　聚类结果对初始类簇中心的选取较为敏感；

　　　　容易陷入局部最优；

　　　　只能发现球型类簇；

posted @ 2023-03-22 19:43 SophiaShen1114 阅读(180) 评论(0) 收藏举报

刷新页面返回顶部

sophia1114

学习数据mining算法收集（1）聚类算法：K-means算法

————————非原创，用来学习，转载https://zhuanlan.zhihu.com/p/77040610——————

1.定义

3.算法实现

numpy实现K-means算法

使用scikit-learn实现K-means算法

5.K-means优缺点

公告

sophia1114

学习数据mining算法收集（1）聚类算法：K-means算法

————————非原创，用来学习，转载https://zhuanlan.zhihu.com/p/77040610——————

1.定义

2.算法思想

3.算法实现

numpy实现K-means算法

使用scikit-learn实现K-means算法

5.K-means优缺点

公告

　　numpy实现K-means算法

　　使用scikit-learn实现K-means算法