shirley_cst

博观而约取,厚积而薄发;淡泊以明志,宁静以致远。
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

聚类算法

Posted on 2012-10-18 10:51  shirley_cst  阅读(644)  评论(0)    收藏  举报

简介:

  Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

  Review:

    1. Data clustering: a review http://eprints.iisc.ernet.in/273/01/p264-jain.pdf

    2. Subspace clustering for high dimensional data: a review

        http://scholar.google.com/citations?view_op=view_citation&hl=en&user=PKiPYEwAAAAJ&citation_for_view=PKiPYEwAAAAJ:u5HHmVD_uO8C

    3. Algorithms for clustering data http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf

  Tutorial: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

  Blog: http://blog.pluskid.org/?page_id=78

算法:

一、划分方法(partitioning method)

  1. k-means/k-modes:k均值/k众值

    [Intro]: http://en.wikipedia.org/wiki/K-means_clustering

    [paper] Top 10 Algorithms in Data Mining, Section 2: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.144.5575

      (1) kmeans算法概述及研究现状

    [paper] Using the Triangle Inequality to Accelerate K-Means: http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf

      (1) 利用三角不等式对k-means的解法进行优化

      (2) 可以处理1000维的向量聚类问题

    [paper] Convergence properties of the k means algorithm: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3258

      (1) 从Gradient Descent的角度来分析k-means算法

      (2) 从EM的角度来分析k-means算法

      (3) 利用Newton Methods来分析k-means算法的收敛性

    [paper] Web-Scale K-means Clustering, D. Sculley, Google

      Mini-Batch K-Means:

        a. Mini-Batch K-Means vs. SGD: less stochastic noise and better solution

        b. Mini-Batch K-Means vs. Standard Batch K-Means: less time consuming

      (1) article: http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf  

      (2) code: http://sofia-ml.googlecode.com/svn/trunk/sofia-ml/

      (3) wiki: http://code.google.com/p/sofia-ml/

    [paper] Efficient projections onto the L1-Ball for learning in high dimensions: http://machinelearning.org/archive/icml2008/papers/361.pdf

  2. k-medoids:k中心点

    (1). PAM(Partitioning Around Medoids):围绕中心点划分算法

    (2). CLARA(Clustering LARge Applications):大型应用聚类

    (3). CLARANS(Clustering Large Application based upon RANdomized Search):基于随机搜索的聚类大型应用

二、层次方法(hierarchical method)

  Wikipedia: http://en.wikipedia.org/wiki/Hierarchical_clustering

  Tutorial:

    http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

    http://www.autonlab.org/tutorials/kmeans.html

  Blog:

    http://blog.pluskid.org/?p=407

    http://www.blogjava.net/changedi/archive/2010/03/19/315963.html

  Code:

    https://github.com/intergret/snippet/blob/master/HAC.py

  1. 算法方法:将数据对象看做确定性的,并且根据对象之间的确定性的距离计算簇。

    (1). 凝聚法

      a. AGNES(Agglomerative NESting)

    (2). 分裂法

      a. DIANA(Divisive ANAlysis)

    (3). 多阶段方法

      a. BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies):利用层次结构的平衡迭代归约和聚类

        Wikipedia: http://en.wikipedia.org/wiki/BIRCH_(data_clustering)

        [paper] BIRCH: An Efficient Data Clustering Method for Very Large Databases: http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf

        [paper] BIRCH: A New Data Clustering Algorithm and Its Applications

            http://www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt

            http://www.dei.unipd.it/~capri/DATAMINING/PAPERS/ZhangRL97.pdf.gz

      b. Chameleon(变色龙):使用动态建模的多阶段层次聚类

        [paper] CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling http://glaros.dtc.umn.edu/gkhome/node/152

        [software] http://www.exetersoftware.com/cat/chameleon/cluster_analysis.html

  2. 概率方法:使用概率模型捕获簇,并且根据模型的拟合度度量簇的质量。

    (1). 概率层次聚类

  3. 贝叶斯方法:计算可能的聚类的分布,即返回给定数据上的一组聚类结构和它们的概率、条件,而不是输出数据集上的单个确定性的类。

三、基于密度的方法(density-based method)

四、基于网格的方法(grid-based method)

 

参考资料:

1. 《数据挖掘:概念与技术》第10章,第1节-第2节