K均值聚类算法

(原创文章，转载请注明出处！)

一、K均值聚类算法

K均值聚类的训练数据是向量，假设样本点是三维向量，它们没有类别标示。所以，

第一步: 要确定聚类中心的个数，比如：3个；然后初始化聚类中心，比如：μ₁、μ₂、μ₃。

第二步: 计算训练数据中的每一个样本点到分别到这三个聚类中心的距离||x-μ₁||₂，

对一个样本点，比较三个距离值，到哪一个聚类中心距离最小，就把这个样本点归为哪一类

第三步: 更新聚类中心。比如对μ₁，将第二步中获得的属于μ₁的样本点向量加起来除以这些样本点的个数，得到新的μ₁

第四步: 根据新的聚类中心，计算所有样本点到其所属的中心的距离，对这些距离求和。

当新计算得到的距离和，与前一次迭代的距离和，一致时（或者达到差值阈值）就停止训练，否则就跳到第二步，继续迭代。

二、训练数据

训练数据是一幅tiff图片。通过程序读出图片中每个像素点的RGB值，那么训练样本点就是三维的向量（R, G, B）。

三、实现

R编程实现如下：

  1 ## parameter:
  2 ##     numCentroid  -  number of cluster centorid
  3 ##     src  -  file path, which is the tiff image source
  4 ##     to   -  file path, which save the processed tiff image
  5 kmeansTiffFun <- function(numCentroid, src, to=NULL) {
  6     library(tiff)
  7     theImage <- readTIFF(src)
  8 
  9     ## initialize centroid randomly
 10     numCentroid <- numCentroid
 11     centroids <- matrix(0, nrow=numCentroid, ncol=3)
 12     centroidsLast <- matrix(-1, nrow=numCentroid, ncol=3)
 13     for (i in 1:numCentroid) {
 14         idx1 <- floor(runif(1,min=1,max=dim(theImage)[1]))
 15         idx2 <- floor(runif(1,min=1,max=dim(theImage)[2]))
 16         centroids[i,] <- theImage[idx1, idx2,]
 17     }
 18 
 19     ## training until converge
 20     trainingNum <- 0
 21     distortionJ <- numeric(1)
 22     pixelCentroid <- array(0,dim=c(dim(theImage)[1],dim(theImage)[2],1))
 23     while ( sum(abs(centroidsLast - centroids)) != 0) {
 24         trainingNum <- trainingNum + 1
 25         cat("-----------------------loop begin-----------------------", trainingNum, "\n")
 26         centroidsLast <- centroids
 27 
 28         ## calculate the label of each pixel
 29         BIG_INTEGER <- 10000
 30         ex9_distance <- 0
 31         distanceMin <- BIG_INTEGER
 32         pixelCentroid <- array(0,dim=c(dim(theImage)[1],dim(theImage)[2],1))
 33         for (i in 1:dim(theImage)[1]) {
 34             for (j in 1:dim(theImage)[2]) {        
 35                 distanceMin <- BIG_INTEGER
 36                 for (k in 1:numCentroid) {
 37                     ex9_distance <- sqrt( crossprod(theImage[i,j,] - centroids[k,]) )
 38                     if (distanceMin > ex9_distance) {
 39                         pixelCentroid[i,j,] <- k
 40                         distanceMin <- ex9_distance
 41                     }
 42                     
 43                     #browser()
 44                 }
 45             }
 46         }
 47         
 48         ## update centroids
 49         for (k in 1:numCentroid) {
 50             sumCentroid <- numeric(3)
 51             countPixel <- 0
 52             for (i in 1:dim(theImage)[1]){
 53                 for (j in 1:dim(theImage)[2]) {
 54                     if ( pixelCentroid[i,j,] == k) {
 55                         sumCentroid <- sumCentroid + theImage[i,j,]
 56                         countPixel <- countPixel + 1
 57                     
 58                         
 59                     }
 60                 }
 61             }
 62 
 63             centroids[k,] <- sumCentroid / countPixel
 64         }
 65         
 66         ## calculate distortioin function
 67         distortionJ[trainingNum] <- 0
 68         for (k in 1:numCentroid) {
 69             for (i in 1:dim(theImage)[1]){
 70                 for (j in 1:dim(theImage)[2]) {
 71                     if ( pixelCentroid[i,j,] == k) {
 72                         distortionJ[trainingNum] <- distortionJ[trainingNum] + 
 73                                            crossprod(theImage[i,j,] - centroids[k,])
 74                     }
 75                 }
 76             }
 77         }    
 78         distortionJ[trainingNum] <- sqrt(distortionJ[trainingNum])
 79         
 80         print(centroids)
 81         print(sum(abs(centroidsLast - centroids)))
 82         print(distortionJ)
 83         cat("-----------------------loop end-----------------------", trainingNum, "\n")
 84     }
 85 
 86     ## update the image
 87     for (i in 1:dim(theImage)[1]) {
 88         for (j in 1:dim(theImage)[2]) {        
 89             distanceMin <- BIG_INTEGER
 90             for (k in 1:numCentroid) {
 91                 if (pixelCentroid[i,j,] == k) {
 92                     theImage[i,j,] <- centroids[k,]
 93                     next
 94                 }
 95             }
 96         }
 97     }
 98     
 99     ## write the updated image to file
100     if (length(to) == 0) {
101         tiffDestination <- paste(src, ".",
102                                      format(Sys.time(), "%Y-%m-%d_%H:%M:%OS3"),
103                                      ".tiff",
104                                      sep="")
105     } else {
106         tiffDestination <- to
107     }
108     writeTIFF(theImage, tiffDestination)
109     
110     
111     detach("package:tiff")
112 }

View Code

四、结果

设定16个聚类中心，对183 X 126的tiff图片进行处理，即有训练样本点23058个。

其中一次训练得到的聚类中心：

 1 # centroids of another training
 2            [,1]      [,2]       [,3]
 3  [1,] 0.7015740 0.7293473 0.61836154
 4  [2,] 0.8640015 0.9052832 0.95715333
 5  [3,] 0.6330738 0.8002675 0.95331941
 6  [4,] 0.4238570 0.4009543 0.09462565
 7  [5,] 0.9388128 0.7597764 0.02820818
 8  [6,] 0.6530845 0.4052968 0.02406638
 9  [7,] 0.7657655 0.5094657 0.02542812
10  [8,] 0.3560828 0.1639515 0.01604344
11  [9,] 0.1433094 0.1297006 0.03679972
12 [10,] 0.8578468 0.8109360 0.28439512
13 [11,] 0.4068034 0.7147497 0.94935582
14 [12,] 0.8547880 0.6246993 0.03221821
15 [13,] 0.9818130 0.9053125 0.02432069
16 [14,] 0.5279613 0.2750294 0.01976112
17 [15,] 0.5883928 0.5643909 0.18766832
18 [16,] 0.2741457 0.2700776 0.06653274

View Code

处理前、后的图片：

可以看出经过处理的图片颜色没有原图片锐丽、鲜亮。但是却节省了很多存储空间。

posted @ 2014-08-07 21:33 activeshj 阅读(999) 评论(0) 编辑收藏举报

刷新页面返回顶部

K均值聚类算法

公告