K均值聚类算法
(原创文章,转载请注明出处!)
一、K均值聚类算法
K均值聚类的训练数据是向量,假设样本点是三维向量,它们没有类别标示。所以,
第一步: 要确定聚类中心的个数,比如:3个;然后初始化聚类中心,比如:μ1、μ2、μ3。
第二步: 计算训练数据中的每一个样本点到分别到这三个聚类中心的距离||x-μ1||2,
对一个样本点,比较三个距离值,到哪一个聚类中心距离最小,就把这个样本点归为哪一类
第三步: 更新聚类中心。比如对μ1,将第二步中获得的属于μ1的样本点向量加起来 除以 这些样本点的个数,得到新的μ1
第四步: 根据新的聚类中心,计算所有样本点到其所属的中心的距离,对这些距离求和。
当新计算得到的距离和,与前一次迭代的距离和,一致时(或者达到差值阈值)就停止训练,否则就跳到第二步,继续迭代。
二、训练数据
训练数据是一幅tiff图片。通过程序读出图片中每个像素点的RGB值,那么训练样本点就是三维的向量(R, G, B)。
三、实现
R编程实现如下:
1 ## parameter: 2 ## numCentroid - number of cluster centorid 3 ## src - file path, which is the tiff image source 4 ## to - file path, which save the processed tiff image 5 kmeansTiffFun <- function(numCentroid, src, to=NULL) { 6 library(tiff) 7 theImage <- readTIFF(src) 8 9 ## initialize centroid randomly 10 numCentroid <- numCentroid 11 centroids <- matrix(0, nrow=numCentroid, ncol=3) 12 centroidsLast <- matrix(-1, nrow=numCentroid, ncol=3) 13 for (i in 1:numCentroid) { 14 idx1 <- floor(runif(1,min=1,max=dim(theImage)[1])) 15 idx2 <- floor(runif(1,min=1,max=dim(theImage)[2])) 16 centroids[i,] <- theImage[idx1, idx2,] 17 } 18 19 ## training until converge 20 trainingNum <- 0 21 distortionJ <- numeric(1) 22 pixelCentroid <- array(0,dim=c(dim(theImage)[1],dim(theImage)[2],1)) 23 while ( sum(abs(centroidsLast - centroids)) != 0) { 24 trainingNum <- trainingNum + 1 25 cat("-----------------------loop begin-----------------------", trainingNum, "\n") 26 centroidsLast <- centroids 27 28 ## calculate the label of each pixel 29 BIG_INTEGER <- 10000 30 ex9_distance <- 0 31 distanceMin <- BIG_INTEGER 32 pixelCentroid <- array(0,dim=c(dim(theImage)[1],dim(theImage)[2],1)) 33 for (i in 1:dim(theImage)[1]) { 34 for (j in 1:dim(theImage)[2]) { 35 distanceMin <- BIG_INTEGER 36 for (k in 1:numCentroid) { 37 ex9_distance <- sqrt( crossprod(theImage[i,j,] - centroids[k,]) ) 38 if (distanceMin > ex9_distance) { 39 pixelCentroid[i,j,] <- k 40 distanceMin <- ex9_distance 41 } 42 43 #browser() 44 } 45 } 46 } 47 48 ## update centroids 49 for (k in 1:numCentroid) { 50 sumCentroid <- numeric(3) 51 countPixel <- 0 52 for (i in 1:dim(theImage)[1]){ 53 for (j in 1:dim(theImage)[2]) { 54 if ( pixelCentroid[i,j,] == k) { 55 sumCentroid <- sumCentroid + theImage[i,j,] 56 countPixel <- countPixel + 1 57 58 59 } 60 } 61 } 62 63 centroids[k,] <- sumCentroid / countPixel 64 } 65 66 ## calculate distortioin function 67 distortionJ[trainingNum] <- 0 68 for (k in 1:numCentroid) { 69 for (i in 1:dim(theImage)[1]){ 70 for (j in 1:dim(theImage)[2]) { 71 if ( pixelCentroid[i,j,] == k) { 72 distortionJ[trainingNum] <- distortionJ[trainingNum] + 73 crossprod(theImage[i,j,] - centroids[k,]) 74 } 75 } 76 } 77 } 78 distortionJ[trainingNum] <- sqrt(distortionJ[trainingNum]) 79 80 print(centroids) 81 print(sum(abs(centroidsLast - centroids))) 82 print(distortionJ) 83 cat("-----------------------loop end-----------------------", trainingNum, "\n") 84 } 85 86 ## update the image 87 for (i in 1:dim(theImage)[1]) { 88 for (j in 1:dim(theImage)[2]) { 89 distanceMin <- BIG_INTEGER 90 for (k in 1:numCentroid) { 91 if (pixelCentroid[i,j,] == k) { 92 theImage[i,j,] <- centroids[k,] 93 next 94 } 95 } 96 } 97 } 98 99 ## write the updated image to file 100 if (length(to) == 0) { 101 tiffDestination <- paste(src, ".", 102 format(Sys.time(), "%Y-%m-%d_%H:%M:%OS3"), 103 ".tiff", 104 sep="") 105 } else { 106 tiffDestination <- to 107 } 108 writeTIFF(theImage, tiffDestination) 109 110 111 detach("package:tiff") 112 }
四、结果
设定16个聚类中心,对183 X 126的tiff图片进行处理 ,即有训练样本点23058个。
其中一次训练得到的聚类中心:
1 # centroids of another training 2 [,1] [,2] [,3] 3 [1,] 0.7015740 0.7293473 0.61836154 4 [2,] 0.8640015 0.9052832 0.95715333 5 [3,] 0.6330738 0.8002675 0.95331941 6 [4,] 0.4238570 0.4009543 0.09462565 7 [5,] 0.9388128 0.7597764 0.02820818 8 [6,] 0.6530845 0.4052968 0.02406638 9 [7,] 0.7657655 0.5094657 0.02542812 10 [8,] 0.3560828 0.1639515 0.01604344 11 [9,] 0.1433094 0.1297006 0.03679972 12 [10,] 0.8578468 0.8109360 0.28439512 13 [11,] 0.4068034 0.7147497 0.94935582 14 [12,] 0.8547880 0.6246993 0.03221821 15 [13,] 0.9818130 0.9053125 0.02432069 16 [14,] 0.5279613 0.2750294 0.01976112 17 [15,] 0.5883928 0.5643909 0.18766832 18 [16,] 0.2741457 0.2700776 0.06653274
处理前、后的图片:
可以看出经过处理的图片颜色没有原图片锐丽、鲜亮。但是却节省了很多存储空间。