Fork me on GitHub

K-Means 聚类算法

K-Means 概念定义:

K-Means 是一种基于距离的排他的聚类划分方法。

上面的 K-Means 描述中包含了几个概念:

  • 聚类(Clustering):K-Means 是一种聚类分析(Cluster Analysis)方法。聚类就是将数据对象分组成为多个类或者簇 (Cluster),使得在同一个簇中的对象之间具有较高的相似度,而不同簇中的对象差别较大。
  • 划分(Partitioning):聚类可以基于划分,也可以基于分层。划分即将对象划分成不同的簇,而分层是将对象分等级。
  • 排他(Exclusive):对于一个数据对象,只能被划分到一个簇中。如果一个数据对象可以被划分到多个簇中,则称为可重叠的(Overlapping)。
  • 距离(Distance):基于距离的聚类是将距离近的相似的对象聚在一起。基于概率分布模型的聚类是在一组对象中,找到能符合特定分布模型的对象的集合,他们不一定是距离最近的或者最相似的,而是能完美的呈现出概率分布模型所描述的模型。

K-Means 问题描述:

给定一个 n 个对象的数据集,它可以构建数据的 k 个划分,每个划分就是一个簇,并且 k ≤ n。同时还需满足:

  1. 每个组至少包含一个对象。
  2. 每个对象必须属于且仅属于一个簇。

Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

例如,有如下包含 10 条数据的集合。集合中每项描述了一个人的身高(Height: inches)和体重(Weight: kilograms)。

Height Weight
-------------
(73.0, 72.6) 
(61.0, 54.4) 
(67.0, 99.9) 
(68.0, 97.3) 
(62.0, 59.0) 
(75.0, 81.6) 
(74.0, 77.1) 
(66.0, 97.3) 
(68.0, 93.3) 
(61.0, 59.0)

通过按照身高和体重的聚类,可以将上述 10 条数据分组成 3 类。

Height Weight
-------------
(67.0, 99.9) 
(68.0, 97.3) 
(66.0, 97.3) 
(68.0, 93.3)

(73.0, 72.6) 
(75.0, 81.6) 
(74.0, 77.1)

(61.0, 54.4) 
(62.0, 59.0) 
(61.0, 59.0)

分类结果可以描述为:中等身高并且很重、很高并且中等体重、矮并且轻。如果用图形来观察分组状况则结果一目了然。

K-Means 算法实现:

由于 K-Means 算法值针对给定的完整数据集进行操作,不需要任何特殊的训练数据,所以 K-Means 是一种无监督的机器学习方法(Unsupervised Machine Learning Technique)。

K-Means 算法最常见的实现方式是使用迭代式精化启发法的 Lloyd's algorithm

  • 给定划分数量 k。创建一个初始划分,从数据集中随机地选择 k 个对象,每个对象初始地代表了一个簇中心(Cluster Centroid)。对于其他对象,计算其与各个簇中心的距离,将它们划入距离最近的簇。
  • 采用迭代的重定位技术,尝试通过对象在划分间移动来改进划分。所谓重定位技术,就是当有新的对象加入簇或者已有对象离开簇的时候,重新计算簇的平均值,然后对对象进行重新分配。这个过程不断重复,直到各簇中对象不再变化为止。
randomly assign all data items to a cluster 
loop until no change in cluster assignments 
  compute centroids for each cluster 
  reassign each data item to cluster of closest centroid 
end

简洁点儿的表述即为:

initialize clustering 
loop 
  update centroids 
  update clustering 
end loop

应用 K-Means 算法到上述身高与体重的示例,聚类过程如下图所示。

K-Means 优缺点:

当结果簇是密集的,而且簇和簇之间的区别比较明显时,K-Means 的效果较好。对于大数据集,K-Means 是相对可伸缩的和高效的,它的复杂度是 O(nkt),n 是对象的个数,k 是簇的数目,t 是迭代的次数,通常 k << n,且 t << n,所以算法经常以局部最优结束。

K-Means 的最大问题是要求先给出 k 的个数。k 的选择一般基于经验值和多次实验结果,对于不同的数据集,k 的取值没有可借鉴性。另外,K-Means 对孤立点数据是敏感的,少量噪声数据就能对平均值造成极大的影响。

Basic K-Means - Lloyd's algorithm C# 代码实现:

Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.

  1 using System;
  2 
  3 namespace ClusterNumeric
  4 {
  5   class ClusterNumProgram
  6   {
  7     static void Main(string[] args)
  8     {
  9       Console.WriteLine("\nBegin k-means clustering demo\n");
 10 
 11       double[][] rawData = new double[10][];
 12       rawData[0] = new double[] { 73, 72.6 };
 13       rawData[1] = new double[] { 61, 54.4 };
 14       rawData[2] = new double[] { 67, 99.9 };
 15       rawData[3] = new double[] { 68, 97.3 };
 16       rawData[4] = new double[] { 62, 59.0 };
 17       rawData[5] = new double[] { 75, 81.6 };
 18       rawData[6] = new double[] { 74, 77.1 };
 19       rawData[7] = new double[] { 66, 97.3 };
 20       rawData[8] = new double[] { 68, 93.3 };
 21       rawData[9] = new double[] { 61, 59.0 };
 22 
 23       Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n");
 24       Console.WriteLine(" ID Height Weight");
 25       Console.WriteLine("---------------------");
 26       ShowData(rawData, 1, true, true);
 27 
 28       int numClusters = 3;
 29       Console.WriteLine("\nSetting numClusters to " + numClusters);
 30 
 31       Console.WriteLine("Starting clustering using k-means algorithm");
 32       Clusterer c = new Clusterer(numClusters);
 33       int[] clustering = c.Cluster(rawData);
 34       Console.WriteLine("Clustering complete\n");
 35 
 36       Console.WriteLine("Final clustering in internal form:\n");
 37       ShowVector(clustering, true);
 38 
 39       Console.WriteLine("Raw data by cluster:\n");
 40       Console.WriteLine(" ID Height Weight");
 41       ShowClustered(rawData, clustering, numClusters, 1);
 42 
 43       Console.WriteLine("\nEnd k-means clustering demo\n");
 44       Console.ReadLine();
 45     }
 46 
 47     static void ShowData(
 48       double[][] data, int decimals,
 49       bool indices, bool newLine)
 50     {
 51       for (int i = 0; i < data.Length; ++i)
 52       {
 53         if (indices == true)
 54           Console.Write(i.ToString().PadLeft(3) + " ");
 55 
 56         for (int j = 0; j < data[i].Length; ++j)
 57         {
 58           double v = data[i][j];
 59           Console.Write(v.ToString("F" + decimals) + "   ");
 60         }
 61 
 62         Console.WriteLine("");
 63       }
 64 
 65       if (newLine == true)
 66         Console.WriteLine("");
 67     }
 68 
 69     static void ShowVector(int[] vector, bool newLine)
 70     {
 71       for (int i = 0; i < vector.Length; ++i)
 72         Console.Write(vector[i] + " ");
 73 
 74       if (newLine == true)
 75         Console.WriteLine("\n");
 76     }
 77 
 78     static void ShowClustered(
 79       double[][] data, int[] clustering,
 80       int numClusters, int decimals)
 81     {
 82       for (int k = 0; k < numClusters; ++k)
 83       {
 84         Console.WriteLine("===================");
 85         for (int i = 0; i < data.Length; ++i)
 86         {
 87           int clusterID = clustering[i];
 88           if (clusterID != k) continue;
 89           Console.Write(i.ToString().PadLeft(3) + " ");
 90           for (int j = 0; j < data[i].Length; ++j)
 91           {
 92             double v = data[i][j];
 93             Console.Write(v.ToString("F" + decimals) + "   ");
 94           }
 95           Console.WriteLine("");
 96         }
 97         Console.WriteLine("===================");
 98       }
 99     }
100   }
101 
102   public class Clusterer
103   {
104     private int numClusters; // number of clusters 
105     private int[] clustering; // index = a tuple, value = cluster ID 
106     private double[][] centroids; // mean (vector) of each cluster 
107     private Random rnd; // for initialization 
108 
109     public Clusterer(int numClusters)
110     {
111       this.numClusters = numClusters;
112       this.centroids = new double[numClusters][];
113       this.rnd = new Random(0); // arbitrary seed 
114     }
115 
116     public int[] Cluster(double[][] data)
117     {
118       int numTuples = data.Length;
119       int numValues = data[0].Length;
120       this.clustering = new int[numTuples];
121 
122       for (int k = 0; k < numClusters; ++k) // allocate each centroid 
123         this.centroids[k] = new double[numValues];
124 
125       InitRandom(data);
126 
127       Console.WriteLine("\nInitial random clustering:");
128       for (int i = 0; i < clustering.Length; ++i)
129         Console.Write(clustering[i] + " ");
130       Console.WriteLine("\n");
131 
132       bool changed = true; // change in clustering? 
133       int maxCount = numTuples * 10; // sanity check 
134       int ct = 0;
135       while (changed == true && ct <= maxCount)
136       {
137         ++ct; // k-means typically converges very quickly 
138         UpdateCentroids(data); // no effect if fail 
139         changed = UpdateClustering(data); // no effect if fail 
140       }
141 
142       int[] result = new int[numTuples];
143       Array.Copy(this.clustering, result, clustering.Length);
144       return result;
145     }
146 
147     private void InitRandom(double[][] data)
148     {
149       int numTuples = data.Length;
150 
151       int clusterID = 0;
152       for (int i = 0; i < numTuples; ++i)
153       {
154         clustering[i] = clusterID++;
155         if (clusterID == numClusters)
156           clusterID = 0;
157       }
158       for (int i = 0; i < numTuples; ++i)
159       {
160         int r = rnd.Next(i, clustering.Length);
161         int tmp = clustering[r];
162         clustering[r] = clustering[i];
163         clustering[i] = tmp;
164       }
165     }
166 
167     private void UpdateCentroids(double[][] data)
168     {
169       int[] clusterCounts = new int[numClusters];
170       for (int i = 0; i < data.Length; ++i)
171       {
172         int clusterID = clustering[i];
173         ++clusterCounts[clusterID];
174       }
175 
176       // zero-out this.centroids so it can be used as scratch 
177       for (int k = 0; k < centroids.Length; ++k)
178         for (int j = 0; j < centroids[k].Length; ++j)
179           centroids[k][j] = 0.0;
180 
181       for (int i = 0; i < data.Length; ++i)
182       {
183         int clusterID = clustering[i];
184         for (int j = 0; j < data[i].Length; ++j)
185           centroids[clusterID][j] += data[i][j]; // accumulate sum 
186       }
187 
188       for (int k = 0; k < centroids.Length; ++k)
189         for (int j = 0; j < centroids[k].Length; ++j)
190           centroids[k][j] /= clusterCounts[k]; // danger? 
191     }
192 
193     private bool UpdateClustering(double[][] data)
194     {
195       // (re)assign each tuple to a cluster (closest centroid) 
196       // returns false if no tuple assignments change OR 
197       // if the reassignment would result in a clustering where 
198       // one or more clusters have no tuples. 
199 
200       bool changed = false; // did any tuple change cluster? 
201 
202       int[] newClustering = new int[clustering.Length]; // proposed result 
203       Array.Copy(clustering, newClustering, clustering.Length);
204 
205       double[] distances = new double[numClusters]; // from tuple to centroids
206 
207       for (int i = 0; i < data.Length; ++i) // walk through each tuple 
208       {
209         for (int k = 0; k < numClusters; ++k)
210           distances[k] = Distance(data[i], centroids[k]);
211 
212         int newClusterID = MinIndex(distances); // find closest centroid 
213         if (newClusterID != newClustering[i])
214         {
215           changed = true; // note a new clustering 
216           newClustering[i] = newClusterID; // accept update 
217         }
218       }
219 
220       if (changed == false)
221         return false; // no change so bail 
222 
223       // check proposed clustering cluster counts 
224       int[] clusterCounts = new int[numClusters];
225       for (int i = 0; i < data.Length; ++i)
226       {
227         int clusterID = newClustering[i];
228         ++clusterCounts[clusterID];
229       }
230 
231       for (int k = 0; k < numClusters; ++k)
232         if (clusterCounts[k] == 0)
233           return false; // bad clustering 
234 
235       Array.Copy(newClustering, clustering, newClustering.Length); // update 
236       return true; // good clustering and at least one change 
237     }
238 
239     // Euclidean distance between two vectors for UpdateClustering() 
240     private static double Distance(double[] tuple, double[] centroid)
241     {
242       double sumSquaredDiffs = 0.0;
243       for (int j = 0; j < tuple.Length; ++j)
244         sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
245       return Math.Sqrt(sumSquaredDiffs);
246     }
247 
248     // helper for UpdateClustering() to find closest centroid 
249     private static int MinIndex(double[] distances)
250     {
251       int indexOfMin = 0;
252       double smallDist = distances[0];
253       for (int k = 1; k < distances.Length; ++k)
254       {
255         if (distances[k] < smallDist)
256         {
257           smallDist = distances[k];
258           indexOfMin = k;
259         }
260       }
261       return indexOfMin;
262     }
263   }
264 }

运行结果如下:

参考资料

本篇文章《K-Means 聚类算法》由 Dennis Gao 发表自博客园个人博客,未经作者本人同意禁止以任何的形式转载,任何自动的或人为的爬虫转载行为均为耍流氓。

posted @ 2015-02-09 23:06 匠心十年 阅读(...) 评论(...) 编辑 收藏