聚类算法
聚类算法
聚类算法是机器学习中涉及对数据进行分组的一种算法。在给定的数据集中,我们可以通过聚类算法将其分成一些不同的组。在理论上,相同的组的数据之间有相同的属性或者是特征,不同组数据之间的属性或者特征相差就会比较大。聚类算法是一种非监督学习算法,并且作为一种常用的数据分析算法在很多领域上得到应用。雷达信号处理中具有广泛的应用。雷达信号聚类可以帮助识别和分类不同类型的目标、检测异常或故障,并提取有用的信息。本文章用来总结一些常用的聚类算法。
K-means聚类
K-means的实现步骤
K-means算法是一种常见的聚类算法,其原理可以简单描述为以下几个步骤:
- 初始化:选择要分成的簇的数量k,并随机选择k个中心点作为初始簇中心。
- 分配数据点:对于每个数据点,计算其与各个簇中心点之间的距离,将数据点分配给最近的簇。
- 更新簇中心:对于每个簇,计算簇中所有数据点的平均值,更新该簇的中心点位置。
- 重复步骤2和步骤3,直到达到终止条件(例如,簇中心不再变化或达到最大迭代次数)。
- 输出结果:返回簇的分配情况和最终的簇中心。
K-means算法的目标是最小化数据点与其所属簇中心之间的总平方距离(也称为误差平方和)。算法通过迭代优化来找到最优的簇中心位置,使得每个数据点都被分配到与其最近的簇。
K-means算法的关键参数是要分成的簇的数量k。选择合适的k值对于聚类结果的准确性和可解释性非常重要。通常会使用启发式方法(如肘部法则、轮廓系数等)来帮助选择最佳的k值。
需要注意的是,K-means算法对初始簇中心的选择敏感。不同的初始簇中心可能导致不同的聚类结果。为了获得更稳定和可靠的结果,可以多次运行算法,并从多个随机初始点开始进行聚类,并选择具有最小误差的结果。
K-means的优缺点
根据以上的实现原理,我们不难分析K-means算法具有以下优点和缺点:
优点:
- 简单而有效:K-means算法是一种简单而直观的聚类算法,易于实现和理解。
- 可扩展性:K-means算法对大规模数据集的处理能力较强,可以有效地处理大量数据。
- 高效性:K-means算法的计算速度通常比其他聚类算法快,尤其是在高维数据集上。
- 聚类结果可解释性强:K-means算法的聚类结果较为直观,每个数据点都被分配到最近的簇中心,并可以根据簇中心来解释和理解聚类结果。
缺点:
- 对初始中心点敏感:K-means算法对初始中心点的选择非常敏感,不同的初始中心可能导致不同的聚类结果。因此,需要进行多次运行以获取稳定的结果。
- 需要预先确定簇的数量:K-means算法需要预先指定要分成的簇的数量k,这对于某些情况下可能并不容易确定,选择不合适的k值可能会导致不佳的聚类结果。
- 不适用于非凸形状的聚类:K-means算法假设簇为凸形状,即每个簇可以用一个中心点来表示。因此,对于非凸形状的聚类问题,K-means算法可能会产生不准确的结果。
- 对噪声和异常值敏感:K-means算法对噪声和异常值较为敏感,这些离群点可能会显著影响簇的分配结果。
综合考虑,K-means算法是一种简单而有效的聚类算法,适用于大规模数据集和对可解释性要求较高的场景。但是在应用时需要注意初始中心点的选择和确定合适的簇数量,以及对噪声和异常值的处理。对于非凸形状的聚类问题,可能需要考虑其他更适合的聚类算法。
代码实现
matlab代码实现
% 生成随机数据
rng(1); % 设置随机种子,以便结果可复现
data = randn(100, 2); % 生成100个二维随机数据点
% 设置参数
k = 3; % 簇的数量
maxIterations = 100; % 最大迭代次数
% 初始化簇中心
centerIndices = randperm(size(data, 1), k); % 随机选择k个数据作为初始簇中心
centers = data(centerIndices, :);
% 迭代更新簇中心和分配数据点
for iter = 1:maxIterations
% 计算每个数据点与各个簇中心的距离
distances = pdist2(data, centers);
% 分配数据点到最近的簇
[~, clusterIndices] = min(distances, [], 2);
% 更新簇中心
for i = 1:k
centers(i, :) = mean(data(clusterIndices == i, :));
end
end
% 绘制聚类结果
colors = {'r', 'g', 'b'}; % 簇的颜色
figure;
hold on;
for i = 1:k
clusterData = data(clusterIndices == i, :);
scatter(clusterData(:, 1), clusterData(:, 2), colors{i});
end
scatter(centers(:, 1), centers(:, 2), 'kx', 'LineWidth', 2);
title('K-means Clustering');
k-means效果
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <stdbool.h>
#include <float.h>
#define MAX_ITERATIONS (100U)
#define INFINITY (FLT_MAX)
/* return the Euclidean distance between two points */
float point_distance(float x1, float y1, float x2, float y2)
{
return sqrt(pow(x2 - x1, 2) + pow(y2 - y1, 2));
}
void kmeans(float data[][2], const int data_size, const int clusters_num, float centroids[][2])
{
int nearest_cluster = 0;
float* cluster_counts = (float*)malloc(clusters_num * sizeof(float));
float* cluster_sums = (float*)malloc(2 * clusters_num * sizeof(float));
float* new_centroids = (float*)malloc(2 * clusters_num *sizeof(float));
int iterations = 0;
/* Initialize the clusters for clustering */
for (int i = 0; i < clusters_num; i++)
{
*(cluster_counts+i) = 0;
*(cluster_sums+(2*i)) = 0.0f;
*(cluster_sums+(2*i)+1) = 0.0f;
}
/* Iteratively update the cluster centers until convergence or the maximum number of iterations is reached */
while(iterations < MAX_ITERATIONS)
{
/* 1 Clear the count and sum of new cluster centers */
for (int i = 0; i < clusters_num; i++)
{
*(new_centroids + (2 * i)) = 0.0f;
*(new_centroids + (2 * i) + 1) = 0.0f;
*(cluster_counts + i) = 0;
*(cluster_sums + (2 * i)) = 0.0f;
*(cluster_sums + (2 * i) + 1) = 0.0f;
}
/* 2 Assign each point to the nearest cluster */
for (int i = 0; i < data_size; i++)
{
float min_dist = INFINITY;
nearest_cluster = -1;
for (int j = 0; j < clusters_num; j++)
{
float dist = point_distance(data[i][0], data[i][1], centroids[j][0], centroids[j][1]);
if(dist < min_dist)
{
min_dist = dist;
nearest_cluster = j;
}
}
/* 3 Update the count and sum of the cluster to which this point belongs. */
cluster_counts[nearest_cluster]++;
*(cluster_sums + (2 * nearest_cluster)) += data[i][0];
*(cluster_sums + (2 * nearest_cluster) + 1) += data[i][1];
}
/* 4 Calculate new cluster centers. */
for (int k = 0; k < clusters_num; k++)
{
*(new_centroids + (2 * k)) = ((*(cluster_sums+2*k))/cluster_counts[k]);
*(new_centroids + (2 * k) + 1) = ((*(cluster_sums + 2*k+1)) / cluster_counts[k]);
}
/* 5 Check if the cluster centers have changed, and if not, stop iterating. */
bool centroids_changed = 0;
for (int i = 0; i < clusters_num; i++)
{
if (point_distance(centroids[i][0], centroids[i][1], *(new_centroids + (2 * i)), *(new_centroids + (2 * i) + 1)) > 0.001)
{
centroids_changed = true;
break;
}
}
if (!centroids_changed)
{
break;
}
/* 6 Update cluster centers */
for (int i = 0; i < clusters_num; i++)
{
centroids[i][0] = *(new_centroids + (2 * i));
centroids[i][1] = *(new_centroids + (2 * i) + 1);
}
iterations++;
}
for (int i = 0; i < clusters_num; i++)
{
printf("i %d,centroids[i][0]:%f, centroids[i][1]:%f\r\n", i, centroids[i][0], centroids[i][1]);
}
float data[][2] =
{
-0.649013765191241,0.365489884844483,
1.18116604196553,-1.09706118485406,
-0.758453297283692,1.93021303957797,
-1.10961303850152,0.622936146753982,
-0.845551240007797,0.657284334157919,
-0.572664866457950,-1.46338342995195,
-0.558680764473972,0.853935439880633,
0.178380225849766,0.580489017953016,
-0.196861446475943,-0.918600977803510,
0.586442621667069,0.794865099567556,
-0.851886969622469,0.517535108228584,
0.800320709801823,0.494614479368152,
-1.50940472473439,0.663929749577619,
0.875874147834533,-0.710171982852718,
-0.242789536333340,-1.30683763858013,
0.166813439453503,-0.741588798931374,
-1.96541870928278,-1.46765924999935,
-1.27007139263854,-0.391674702884971,
1.17517126546302,0.841658856202464,
2.02916018474976,0.0827838447500025,
-0.275157240675694,0.314671414095507,
0.603658445825815,0.789805048123364,
1.78125189324250,-0.801223905939208,
1.77365832632615,-0.325654259400935,
-1.86512257453063,0.284676319250753,
-1.05110705924059,1.30961815303393,
-0.417382047996795,0.160373337872621,
1.40216228633781,-2.11818831683966,
-1.36774699097611,0.707080597764518,
-0.292534999151874,-1.04341356999711,
1.27084843418894,1.06820660928836,
0.0660093412882059,-0.317233553000261,
0.451290213630776,1.47967667627648,
-0.322209718011896,0.699087988758093,
0.788409216227425,0.159099278102292,
0.928736046813314,-0.945480508446840,
-0.490790376269763,-0.793006508081058,
1.79720058425494,-2.04923899287211,
0.590696551205452,-2.35883505694189,
-0.635785737847226,-1.65926925495582,
0.603346612845761,-0.958123762757588,
-0.535247967775900,0.225729866287135,
-0.155080385492789,0.217665351270991,
0.612122370772160,-0.823239284502371,
-1.04434349451734,-1.01276803434617,
-0.345631908307050,1.21525791551081,
-1.17140482049761,0.156275375341672,
-0.685586780437283,-0.400257312164849,
0.926216394168962,-0.441779076742656,
-1.48167521167231,0.448102274787604,
-0.558057808685045,-1.66459080006458,
-0.0284531115706568,0.214890241522152,
-1.47629235201010,0.549563133368974,
0.258899957160403,1.39233785356506,
-2.01869095243834,-0.619227629415020,
0.199740262298379,-0.0126012833273179,
0.425864319131210,0.773612390473314,
-1.27004345059705,1.62921206030486,
-0.485218835743043,-1.40997503421309,
0.594307616829848,-1.74728260275247,
-0.276464906639256,-0.472246143478651,
-1.85758288592737,-0.0600882414964135,
0.0407308117494288,0.438878570591521,
0.282970177161990,0.201222206924178,
0.0635612193024994,-0.583298376637002,
0.433430065111595,0.764796690674390,
0.422860364487685,0.140769815626260,
1.29952829655200,-0.372936894896390,
-1.04979323447507,0.105466931719111,
-1.78641172211092,1.27062430143441,
0.816043081031918,0.499129927163972,
-0.328208543142512,-0.397024986016533,
-1.21456561358767,-1.78996819380666,
1.11183287253465,-0.266894331854867,
-0.507496954829846,0.178431074753012,
0.898730486034072,-0.434192174480561,
0.377215659958544,0.464513248320177,
1.45239164558790,-1.12144527588722,
0.446945073178942,-0.359075138759723,
0.645824788453030,0.532266690356129,
-0.623677409296163,-1.64350780294275,
-0.595236431548712,0.466899224814047,
1.61132368718055,0.112107258838170,
-0.348998045314167,1.49654440857240,
0.164167484938754,-0.586502255515034,
-1.63657708517891,-1.71893202713396,
0.581365555343623,0.741040148219291,
-0.128905996910632,1.08769551523333,
0.432858634222399,0.756670133381815,
-0.245109040039237,1.62969395971859,
-1.08543038934632,-1.37499337757164,
1.68080151955536,-1.05201055264159,
0.176411940863882,0.477517893688704,
-2.07143962693628,1.22217693430612,
0.211089334851037,2.37073224069742,
-0.582847822547194,0.114585880363316,
0.0181688430923922,0.279069293453606,
1.49477799287395,0.752079823603291,
-0.424796733441211,-0.260256948894717,
1.68624315536028,-0.0259932042401371
};
float centroids[3][2] =
{
-0.535247967775900,0.225729866287135,
-0.328208543142512,-0.397024986016533,
-0.649013765191241,0.365489884844483,
};
int main(void)
{
kmeans(data, sizeof(data)/sizeof(data[0]), 3, centroids);
return 0;
}
C语言算法和MATLAB对比后,运行结果一致。仿真实现完成