文本挖掘之文本聚类(OPTICS)

刘 勇  Email:lyssym@sina.com

  鉴于DBSCAN算法对输入参数,邻域半径E和阈值M比较敏感,在参数调优时比较麻烦,因此本文对另一种基于密度的聚类算法OPTICS(Ordering Points To Identify the Clustering Structure)展开研究,该算法是DBSCAN的改进算法,与DBSCAN相比,该算法对输入参数不敏感。此外,OPTICS算法不显示地生成数据聚类,其只是对数据对象集合中的对象进行排序,获取一个有序的对象列表,其中包含了足够的信息能用来提取聚类。在实际的应用中,可利用该有序的对象序列,对数据的分布展开分析以及对数据的关联进行分析。

基本概念

  由于OPTICS是对DBSCAN算法的一种改进,因此许多概念是共用的,如核心对象、(直接)密度可达、密度相连等,具体内容参考DBSCAN。在上述内容的基础上,本文再引入两个核心概念。

  (1) 核心距离

  在数据集合D中,对于给定的参数EM,称使得p成为核心对象的最小邻域半径为p的核心距离。具体数学表达式如下所示:

  通俗意义上来说,在给定的参数EM上,p的核心距离为距离值中的第M个最小值(最大值),该距离表征可以为欧式距离、余弦相似度或Word2Vec等。

  (2) 可达距离

  在数据集合D中,对于给定的参数EM,称对象p的核心距离与对象p和o距离,二者之间最大值为o关于p的可达距离。具体数学表达式如下所示:

  程序伪代码(参考维基百科):

 1 OPTICS(DB, eps, MinPts)
 2     for each point p of DB
 3        p.reachability-distance = UNDEFINED
 4     for each unprocessed point p of DB
 5        N = getNeighbors(p, eps)
 6        mark p as processed
 7        output p to the ordered list
 8        if (core-distance(p, eps, Minpts) != UNDEFINED)
 9           Seeds = empty priority queue
10           update(N, p, Seeds, eps, Minpts)
11           for each next q in Seeds
12              N' = getNeighbors(q, eps)
13              mark q as processed
14              output q to the ordered list
15              if (core-distance(q, eps, Minpts) != UNDEFINED)
16                 update(N', q, Seeds, eps, Minpts)
17 
18 
19 update(N, p, Seeds, eps, Minpts)
20     coredist = core-distance(p, eps, MinPts)
21     for each o in N
22        if (o is not processed)
23           new-reach-dist = max(coredist, dist(p,o))
24           if (o.reachability-distance == UNDEFINED) // o is not in Seeds
25               o.reachability-distance = new-reach-dist
26               Seeds.insert(o, new-reach-dist)
27           else               // o in Seeds, check for improvement
28               if (new-reach-dist < o.reachability-distance)
29                  o.reachability-distance = new-reach-dist
30                  Seeds.move-up(o, new-reach-dist)

  程序源代码:

 1 import java.util.List;
 2 
 3 import com.gta.cosine.ElementDict;
 4 
 5 public class DataPoint {
 6     private List<ElementDict> terms;
 7     private double initDistance;
 8     private double coreDistance;
 9     private double reachableDistance;
10     private boolean isVisited;
11     
12     
13     public DataPoint(List<ElementDict> terms) {
14         this.terms = terms;
15         this.initDistance = -1;
16         this.coreDistance = -1;
17         this.reachableDistance = -1;
18         this.isVisited = false;
19     }
20     
21     
22     public double getCoreDistance() {
23         return coreDistance;
24     }
25 
26 
27     public void setCoreDistance(double coreDistance) {
28         this.coreDistance = coreDistance;
29     }
30 
31 
32     public double getReachableDistance() {
33         return reachableDistance;
34     }
35 
36 
37     public void setReachableDistance(double reachableDistance) {
38         this.reachableDistance = reachableDistance;
39     }
40     
41     
42     public boolean getIsVisitLabel() {
43         return isVisited;
44     }
45     
46     
47     public void setIsVisitLabel(boolean isVisited) {
48         this.isVisited = isVisited;
49     }
50     
51     
52     public double getInitDistance() {
53         return initDistance;
54     }
55 
56 
57     public void setInitDistance(double initDistance) {
58         this.initDistance = initDistance;
59     }
60 
61 
62     public List<ElementDict> getAllElements() {
63         return terms;
64     }
65     
66     
67     public ElementDict getElement(int index) {
68         return terms.get(index);
69     }
70     
71     
72     public boolean equals(DataPoint dp)
73     {
74         List<ElementDict> ed1 = getAllElements();
75         List<ElementDict> ed2 = dp.getAllElements();
76         int len = ed1.size();
77         
78         if (len != ed2.size())
79         {
80             return false;
81         }
82         
83         for (int i = 0; i < len; i++)
84         {
85             if (!ed1.get(i).equals(ed2.get(i)))
86             {
87                 return false;
88             }
89         }
90         return true;
91     }
92     
93 }
  1 import java.util.Comparator;
  2 import java.util.List;
  3 import java.util.ArrayList;
  4 import java.util.Collections;
  5 import java.util.Queue;
  6 import java.util.PriorityQueue;
  7 
  8 import com.gta.cosine.ElementDict;
  9 import com.gta.cosine.TextCosine;
 10 
 11 public class OPTICS {
 12     private double            eps;
 13     private int               minPts;
 14     private TextCosine        cosine;
 15     private List<DataPoint>   dataPoints;
 16     private List<DataPoint>   orderList;
 17     
 18     public OPTICS(double eps, int minPts)
 19     {
 20         this.eps = eps;
 21         this.minPts = minPts;
 22         this.cosine = new TextCosine();
 23         this.dataPoints = new ArrayList<DataPoint>();
 24         this.orderList = new ArrayList<DataPoint>();
 25     }
 26     
 27     
 28     public void addPoint(String s)
 29     {
 30         List<ElementDict> ed = cosine.tokenizer(s);
 31         dataPoints.add(new DataPoint(ed));
 32     }
 33     
 34     
 35     public double coreDistance(List<DataPoint> neighbors)
 36     {
 37         double ret = -1;
 38         if (neighbors.size() >= minPts)
 39         {
 40             Collections.sort(neighbors, new Comparator<DataPoint>() {
 41                         public int compare(DataPoint dp1, DataPoint dp2) {
 42                             double cd = dp1.getInitDistance() - dp2.getInitDistance();
 43                             if (cd < 0) {
 44                                 return 1;
 45                             } else {
 46                                 return -1;
 47                             }
 48                         }
 49                     });
 50             
 51             ret = neighbors.get(minPts-1).getInitDistance();
 52         }
 53         return ret;
 54     }
 55     
 56     
 57     public double cosineDistance(DataPoint p, DataPoint q)
 58     {
 59         List<ElementDict> vec1 = p.getAllElements();
 60         List<ElementDict> vec2 = q.getAllElements();
 61         return cosine.analysisText(vec1, vec2);
 62     }
 63     
 64 
 65     public List<DataPoint> getNeighbors(DataPoint p, List<DataPoint> points)
 66     {
 67         List<DataPoint> neighbors = new ArrayList<DataPoint>();
 68         double countDistance = -1;
 69         for (DataPoint q : points)
 70         {
 71             countDistance = cosineDistance(p, q);
 72             if (countDistance >= eps)
 73             {
 74                 q.setInitDistance(countDistance);
 75                 neighbors.add(q);
 76             }
 77         }
 78         return neighbors;
 79     }
 80     
 81     
 82     public void cluster(List<DataPoint> points)
 83     {
 84         for (DataPoint point : points)
 85         {
 86             if (!point.getIsVisitLabel())
 87             {
 88                 List<DataPoint> neighbors = getNeighbors(point, points);
 89                 point.setIsVisitLabel(true);
 90                 orderList.add(point);
 91                 double cd = coreDistance(neighbors);
 92                 if (cd != -1)
 93                 {
 94                     point.setCoreDistance(cd);
 95                     Queue<DataPoint> seeds = new PriorityQueue<DataPoint>(16, new Comparator<DataPoint>() {
 96                             public int compare (DataPoint dp1, DataPoint dp2) {
 97                                 double rd = dp1.getReachableDistance() - dp2.getReachableDistance();
 98                                 if (rd < 0) {
 99                                     return 1;
100                                 } else {
101                                     return -1;
102                                 }
103                             }
104                         });
105                     
106                     update(point, neighbors, seeds, orderList);
107                     while (!seeds.isEmpty()) 
108                     {
109                         DataPoint q = seeds.poll();
110                         List<DataPoint> newNeighbors = getNeighbors(q, points);
111                         q.setIsVisitLabel(true);
112                         orderList.add(q);
113                         if (coreDistance(newNeighbors) != -1)
114                         {
115                             update(q, newNeighbors, seeds, orderList);
116                         }
117                     }
118                 }
119             }
120         }
121     }
122     
123     
124     public void update(DataPoint p, List<DataPoint> neighbors, Queue<DataPoint> seeds, List<DataPoint> seqList)
125     {
126         double coreDistance = coreDistance(neighbors);
127         for (DataPoint point : neighbors)
128         {
129             double cosineDistance = cosineDistance(p, point);
130             double reachableDistance = coreDistance > cosineDistance ? coreDistance : cosineDistance;
131             if (!point.getIsVisitLabel())
132             {
133                 if (point.getReachableDistance() == -1)
134                 {
135                     point.setReachableDistance(reachableDistance);
136                     seeds.add(point);
137                 }
138                 else
139                 {
140                     if (point.getReachableDistance() > reachableDistance)
141                     {
142                         if (seeds.remove(point)) 
143                         {
144                             point.setReachableDistance(reachableDistance);
145                             seeds.add(point);
146                         }
147                     }
148                 }
149             }
150             else 
151             {
152                 if (point.getReachableDistance() == -1)
153                 {
154                     point.setReachableDistance(reachableDistance);
155                     if (seqList.remove(point))
156                     {
157                         seeds.add(point);
158                     }
159                 }
160             }
161         }
162     }
163     
164     
165     public void showCluster()
166     {
167         for (DataPoint point : orderList)
168         {
169             
170             List<ElementDict> ed = point.getAllElements();
171             for (ElementDict e : ed)
172             {
173                 System.out.print(e.getTerm() + "  ");
174             }
175             System.out.println();
176             System.out.println("core:  " + point.getCoreDistance());
177             System.out.println("reach: " + point.getReachableDistance());
178             System.out.println("***************************************");
179         }
180     }
181     
182     
183     public void analysis()
184     {
185         cluster(dataPoints);
186         showCluster();
187     }
188     
189     
190     public int IndexOfList(DataPoint o, Queue<DataPoint> points)
191     {
192         int index = 0;
193         for (DataPoint p : points)
194         {
195             if (o.equals(p))
196             {
197                 break;
198             }
199             index ++;
200         }
201         return index;
202     }
203 
204 }

  本文计算距离时采用余弦相似度,具体内容参考本系列之文本挖掘之文本相似度判定。此外,本文经过分析,某些(个)对象之前已被访问后,例如某个边界对象,其核心距离保持为初始值,若严格按照伪代码所示处理,其结果与DBSCAN的结果有些出入,因此本文作者对OPTICS进行了一点修改,使这类对象的可达距离能被修改,并将其添加至列表中,因此,在整体上其处理结果与DBSCAN算法的处理结果保持一致。本文作者认为这样做是有效的,而且存在一定的必要性,若有更好的解决方案,请联系我

  由于OPTICS算法所获取的是对象的有序列表,对后续数据分析、挖掘,具有较高的应用价值,因此,该算法可以作为数据预处理的前奏部分。但是,该算法由于需要维护优先级队列,因而在效率上有点影响。

 

 


  作者:志青云集
  出处:http://www.cnblogs.com/lyssym
  如果,您认为阅读这篇博客让您有些收获,不妨点击一下右下角的【推荐】。
  如果,您希望更容易地发现我的新博客,不妨点击一下左下角的【关注我】。
  如果,您对我的博客所讲述的内容有兴趣,请继续关注我的后续博客,我是【志青云集】。
  本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。


 

posted on 2015-11-09 19:42  志青云集  阅读(2171)  评论(0编辑  收藏  举报

导航