简介:
Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis
Review:
1. Data clustering: a review http://eprints.iisc.ernet.in/273/01/p264-jain.pdf
2. Subspace clustering for high dimensional data: a review
3. Algorithms for clustering data http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf
Tutorial: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
Blog: http://blog.pluskid.org/?page_id=78
算法:
一、划分方法(partitioning method)
1. k-means/k-modes:k均值/k众值
[Intro]: http://en.wikipedia.org/wiki/K-means_clustering
[paper] Top 10 Algorithms in Data Mining, Section 2: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.144.5575
(1) kmeans算法概述及研究现状
[paper] Using the Triangle Inequality to Accelerate K-Means: http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf
(1) 利用三角不等式对k-means的解法进行优化
(2) 可以处理1000维的向量聚类问题
[paper] Convergence properties of the k means algorithm: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3258
(1) 从Gradient Descent的角度来分析k-means算法
(2) 从EM的角度来分析k-means算法
(3) 利用Newton Methods来分析k-means算法的收敛性
[paper] Web-Scale K-means Clustering, D. Sculley, Google
Mini-Batch K-Means:
a. Mini-Batch K-Means vs. SGD: less stochastic noise and better solution
b. Mini-Batch K-Means vs. Standard Batch K-Means: less time consuming
(1) article: http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
(2) code: http://sofia-ml.googlecode.com/svn/trunk/sofia-ml/
(3) wiki: http://code.google.com/p/sofia-ml/
[paper] Efficient projections onto the L1-Ball for learning in high dimensions: http://machinelearning.org/archive/icml2008/papers/361.pdf
2. k-medoids:k中心点
(1). PAM(Partitioning Around Medoids):围绕中心点划分算法
(2). CLARA(Clustering LARge Applications):大型应用聚类
(3). CLARANS(Clustering Large Application based upon RANdomized Search):基于随机搜索的聚类大型应用
二、层次方法(hierarchical method)
Wikipedia: http://en.wikipedia.org/wiki/Hierarchical_clustering
Tutorial:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
http://www.autonlab.org/tutorials/kmeans.html
Blog:
http://blog.pluskid.org/?p=407
http://www.blogjava.net/changedi/archive/2010/03/19/315963.html
Code:
https://github.com/intergret/snippet/blob/master/HAC.py
1. 算法方法:将数据对象看做确定性的,并且根据对象之间的确定性的距离计算簇。
(1). 凝聚法
a. AGNES(Agglomerative NESting)
(2). 分裂法
a. DIANA(Divisive ANAlysis)
(3). 多阶段方法
a. BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies):利用层次结构的平衡迭代归约和聚类
Wikipedia: http://en.wikipedia.org/wiki/BIRCH_(data_clustering)
[paper] BIRCH: An Efficient Data Clustering Method for Very Large Databases: http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
[paper] BIRCH: A New Data Clustering Algorithm and Its Applications
http://www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt
http://www.dei.unipd.it/~capri/DATAMINING/PAPERS/ZhangRL97.pdf.gz
b. Chameleon(变色龙):使用动态建模的多阶段层次聚类
[paper] CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling http://glaros.dtc.umn.edu/gkhome/node/152
[software] http://www.exetersoftware.com/cat/chameleon/cluster_analysis.html
2. 概率方法:使用概率模型捕获簇,并且根据模型的拟合度度量簇的质量。
(1). 概率层次聚类
3. 贝叶斯方法:计算可能的聚类的分布,即返回给定数据上的一组聚类结构和它们的概率、条件,而不是输出数据集上的单个确定性的类。
三、基于密度的方法(density-based method)
四、基于网格的方法(grid-based method)
参考资料:
1. 《数据挖掘:概念与技术》第10章,第1节-第2节
浙公网安备 33010602011771号