1、KNN算法概述

  kNN算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 

2、KNN算法介绍

   最简单最初级的分类器是将全部的训练数据所对应的类别都记录下来,当测试对象的属性和某个训练对象的属性完全匹配时,便可以对其进行分类。但是怎么可能所有测试对象都会找到与之完全匹配的训练对象呢,其次就是存在一个测试对象同时与多个训练对象匹配,导致一个训练对象被分到了多个类的问题,基于这些问题呢,就产生了KNN。

     KNN是通过测量不同特征值之间的距离进行分类。它的的思路是:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。K通常是不大于20的整数。KNN算法中,所选择的邻居都是已经正确分类的对象。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。

     下面通过一个简单的例子说明一下:如下图,绿色圆要被决定赋予哪个类,是红色三角形还是蓝色四方形?如果K=3,由于红色三角形所占比例为2/3,绿色圆将被赋予红色三角形那个类,如果K=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类。

 

由此也说明了KNN算法的结果很大程度取决于K的选择。

     在KNN中,通过计算对象间距离来作为各个对象之间的非相似性指标,避免了对象之间的匹配问题,在这里距离一般使用欧氏距离或曼哈顿距离:

                      

同时,KNN通过依据k个对象中占优的类别进行决策,而不是单一的对象类别决策。这两点就是KNN算法的优势。

   接下来对KNN算法的思想总结一下:就是在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:

1)计算测试数据与各个训练数据之间的距离;

2)按照距离的递增关系进行排序;

3)选取距离最小的K个点;

4)确定前K个点所在类别的出现频率;

5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。

3、算法实现

  KNN算法不仅可以用于分类,还可以用于回归。通过找出一个样本的k个最近邻居,将这些邻居的属性的平均值赋给该样本,就可以得到该样本的属性。更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight),如权值与距离成反比。

训练数据:

40920   8.326976    0.953952    3
14488   7.153469    1.673904    2
26052   1.441871    0.805124    1
75136   13.147394   0.428964    1
38344   1.669788    0.134296    1
72993   10.141740   1.032955    1
35948   6.830792    1.213192    3
42666   13.276369   0.543880    3
67497   8.631577    0.749278    1
35483   12.273169   1.508053    3
50242   3.723498    0.831917    1
63275   8.385879    1.669485    1
5569    4.875435    0.728658    2
51052   4.680098    0.625224    1
77372   15.299570   0.331351    1
43673   1.889461    0.191283    1
61364   7.516754    1.269164    1
69673   14.239195   0.261333    1
15669   0.000000    1.250185    2
28488   10.528555   1.304844    3
6487    3.540265    0.822483    2
37708   2.991551    0.833920    1
22620   5.297865    0.638306    2
28782   6.593803    0.187108    3
19739   2.816760    1.686209    2
36788   12.458258   0.649617    3
5741    0.000000    1.656418    2
28567   9.968648    0.731232    3
6808    1.364838    0.640103    2
41611   0.230453    1.151996    1
36661   11.865402   0.882810    3
43605   0.120460    1.352013    1
15360   8.545204    1.340429    3
63796   5.856649    0.160006    1
10743   9.665618    0.778626    2
70808   9.778763    1.084103    1
72011   4.932976    0.632026    1
5914    2.216246    0.587095    2
14851   14.305636   0.632317    3
33553   12.591889   0.686581    3
44952   3.424649    1.004504    1
17934   0.000000    0.147573    2
27738   8.533823    0.205324    3
29290   9.829528    0.238620    3
42330   11.492186   0.263499    3
36429   3.570968    0.832254    1
39623   1.771228    0.207612    1
32404   3.513921    0.991854    1
27268   4.398172    0.975024    1
5477    4.276823    1.174874    2
14254   5.946014    1.614244    2
68613   13.798970   0.724375    1
41539   10.393591   1.663724    3
7917    3.007577    0.297302    2
21331   1.031938    0.486174    2
8338    4.751212    0.064693    2
5176    3.692269    1.655113    2
18983   10.448091   0.267652    3
68837   10.585786   0.329557    1
13438   1.604501    0.069064    2
48849   3.679497    0.961466    1
12285   3.795146    0.696694    2
7826    2.531885    1.659173    2
5565    9.733340    0.977746    2
10346   6.093067    1.413798    2
1823    7.712960    1.054927    2
9744    11.470364   0.760461    3
16857   2.886529    0.934416    2
39336   10.054373   1.138351    3
65230   9.972470    0.881876    1
2463    2.335785    1.366145    2
27353   11.375155   1.528626    3
16191   0.000000    0.605619    2
12258   4.126787    0.357501    2
42377   6.319522    1.058602    1
25607   8.680527    0.086955    3
77450   14.856391   1.129823    1
58732   2.454285    0.222380    1
46426   7.292202    0.548607    3
32688   8.745137    0.857348    3
64890   8.579001    0.683048    1
8554    2.507302    0.869177    2
28861   11.415476   1.505466    3
42050   4.838540    1.680892    1
32193   10.339507   0.583646    3
64895   6.573742    1.151433    1
2355    6.539397    0.462065    2
0   2.209159    0.723567    2
70406   11.196378   0.836326    1
57399   4.229595    0.128253    1
41732   9.505944    0.005273    3
11429   8.652725    1.348934    3
75270   17.101108   0.490712    1
5459    7.871839    0.717662    2
73520   8.262131    1.361646    1
40279   9.015635    1.658555    3
21540   9.215351    0.806762    3
17694   6.375007    0.033678    2
22329   2.262014    1.022169    1
46570   5.677110    0.709469    1
42403   11.293017   0.207976    3
33654   6.590043    1.353117    1
9171    4.711960    0.194167    2
28122   8.768099    1.108041    3
34095   11.502519   0.545097    3
1774    4.682812    0.578112    2
40131   12.446578   0.300754    3
13994   12.908384   1.657722    3
77064   12.601108   0.974527    1
11210   3.929456    0.025466    2
6122    9.751503    1.182050    3
15341   3.043767    0.888168    2
44373   4.391522    0.807100    1
28454   11.695276   0.679015    3
63771   7.879742    0.154263    1
9217    5.613163    0.933632    2
69076   9.140172    0.851300    1
24489   4.258644    0.206892    1
16871   6.799831    1.221171    2
39776   8.752758    0.484418    3
5901    1.123033    1.180352    2
40987   10.833248   1.585426    3
7479    3.051618    0.026781    2
38768   5.308409    0.030683    3
4933    1.841792    0.028099    2
32311   2.261978    1.605603    1
26501   11.573696   1.061347    3
37433   8.038764    1.083910    3
23503   10.734007   0.103715    3
68607   9.661909    0.350772    1
27742   9.005850    0.548737    3
11303   0.000000    0.539131    2
0   5.757140    1.062373    2
32729   9.164656    1.624565    3
24619   1.318340    1.436243    1
42414   14.075597   0.695934    3
20210   10.107550   1.308398    3
33225   7.960293    1.219760    3
54483   6.317292    0.018209    1
18475   12.664194   0.595653    3
33926   2.906644    0.581657    1
43865   2.388241    0.913938    1
26547   6.024471    0.486215    3
44404   7.226764    1.255329    3
16674   4.183997    1.275290    2
8123    11.850211   1.096981    3
42747   11.661797   1.167935    3
56054   3.574967    0.494666    1
10933   0.000000    0.107475    2
18121   7.937657    0.904799    3
11272   3.365027    1.014085    2
16297   0.000000    0.367491    2
28168   13.860672   1.293270    3
40963   10.306714   1.211594    3
31685   7.228002    0.670670    3
55164   4.508740    1.036192    1
17595   0.366328    0.163652    2
1862    3.299444    0.575152    2
57087   0.573287    0.607915    1
63082   9.183738    0.012280    1
51213   7.842646    1.060636    3
6487    4.750964    0.558240    2
4805    11.438702   1.556334    3
30302   8.243063    1.122768    3
68680   7.949017    0.271865    1
17591   7.875477    0.227085    2
74391   9.569087    0.364856    1
37217   7.750103    0.869094    3
42814   0.000000    1.515293    1
14738   3.396030    0.633977    2
19896   11.916091   0.025294    3
14673   0.460758    0.689586    2
32011   13.087566   0.476002    3
58736   4.589016    1.672600    1
54744   8.397217    1.534103    1
29482   5.562772    1.689388    1
27698   10.905159   0.619091    3
11443   1.311441    1.169887    2
56117   10.647170   0.980141    3
39514   0.000000    0.481918    1
26627   8.503025    0.830861    3
16525   0.436880    1.395314    2
24368   6.127867    1.102179    1
22160   12.112492   0.359680    3
6030    1.264968    1.141582    2
6468    6.067568    1.327047    2
22945   8.010964    1.681648    3
18520   3.791084    0.304072    2
34914   11.773195   1.262621    3
6121    8.339588    1.443357    2
38063   2.563092    1.464013    1
23410   5.954216    0.953782    1
35073   9.288374    0.767318    3
52914   3.976796    1.043109    1
16801   8.585227    1.455708    3
9533    1.271946    0.796506    2
16721   0.000000    0.242778    2
5832    0.000000    0.089749    2
44591   11.521298   0.300860    3
10143   1.139447    0.415373    2
21609   5.699090    1.391892    2
23817   2.449378    1.322560    1
15640   0.000000    1.228380    2
8847    3.168365    0.053993    2
50939   10.428610   1.126257    3
28521   2.943070    1.446816    1
32901   10.441348   0.975283    3
42850   12.478764   1.628726    3
13499   5.856902    0.363883    2
40345   2.476420    0.096075    1
43547   1.826637    0.811457    1
70758   4.324451    0.328235    1
19780   1.376085    1.178359    2
44484   5.342462    0.394527    1
54462   11.835521   0.693301    3
20085   12.423687   1.424264    3
42291   12.161273   0.071131    3
47550   8.148360    1.649194    3
11938   1.531067    1.549756    2
40699   3.200912    0.309679    1
70908   8.862691    0.530506    1
73989   6.370551    0.369350    1
11872   2.468841    0.145060    2
48463   11.054212   0.141508    3
15987   2.037080    0.715243    2
70036   13.364030   0.549972    1
32967   10.249135   0.192735    3
63249   10.464252   1.669767    1
42795   9.424574    0.013725    3
14459   4.458902    0.268444    2
19973   0.000000    0.575976    2
5494    9.686082    1.029808    3
67902   13.649402   1.052618    1
25621   13.181148   0.273014    3
27545   3.877472    0.401600    1
58656   1.413952    0.451380    1
7327    4.248986    1.430249    2
64555   8.779183    0.845947    1
8998    4.156252    0.097109    2
11752   5.580018    0.158401    2
76319   15.040440   1.366898    1
27665   12.793870   1.307323    3
67417   3.254877    0.669546    1
21808   10.725607   0.588588    3
15326   8.256473    0.765891    2
20057   8.033892    1.618562    3
79341   10.702532   0.204792    1
15636   5.062996    1.132555    2
35602   10.772286   0.668721    3
28544   1.892354    0.837028    1
57663   1.019966    0.372320    1
78727   15.546043   0.729742    1
68255   11.638205   0.409125    1
14964   3.427886    0.975616    2
21835   11.246174   1.475586    3
7487    0.000000    0.645045    2
8700    0.000000    1.424017    2
26226   8.242553    0.279069    3
65899   8.700060    0.101807    1
6543    0.812344    0.260334    2
46556   2.448235    1.176829    1
71038   13.230078   0.616147    1
47657   0.236133    0.340840    1
19600   11.155826   0.335131    3
37422   11.029636   0.505769    3
1363    2.901181    1.646633    2
26535   3.924594    1.143120    1
47707   2.524806    1.292848    1
38055   3.527474    1.449158    1
6286    3.384281    0.889268    2
10747   0.000000    1.107592    2
44883   11.898890   0.406441    3
56823   3.529892    1.375844    1
68086   11.442677   0.696919    1
70242   10.308145   0.422722    1
11409   8.540529    0.727373    2
67671   7.156949    1.691682    1
61238   0.720675    0.847574    1
17774   0.229405    1.038603    2
53376   3.399331    0.077501    1
30930   6.157239    0.580133    1
28987   1.239698    0.719989    1
13655   6.036854    0.016548    2
7227    5.258665    0.933722    2
40409   12.393001   1.571281    3
13605   9.627613    0.935842    2
26400   11.130453   0.597610    3
13491   8.842595    0.349768    3
30232   10.690010   1.456595    3
43253   5.714718    1.674780    3
55536   3.052505    1.335804    1
8807    0.000000    0.059025    2
25783   9.945307    1.287952    3
22812   2.719723    1.142148    1
77826   11.154055   1.608486    1
38172   2.687918    0.660836    1
31676   10.037847   0.962245    3
74038   12.404762   1.112080    1
44738   10.237305   0.633422    3
17410   4.745392    0.662520    2
5688    4.639461    1.569431    2
36642   3.149310    0.639669    1
29956   13.406875   1.639194    3
60350   6.068668    0.881241    1
23758   9.477022    0.899002    3
25780   3.897620    0.560201    2
11342   5.463615    1.203677    2
36109   3.369267    1.575043    1
14292   5.234562    0.825954    2
11160   0.000000    0.722170    2
23762   12.979069   0.504068    3
39567   5.376564    0.557476    1
25647   13.527910   1.586732    3
14814   2.196889    0.784587    2
73590   10.691748   0.007509    1
35187   1.659242    0.447066    1
49459   8.369667    0.656697    3
31657   13.157197   0.143248    3
6259    8.199667    0.908508    2
33101   4.441669    0.439381    3
27107   9.846492    0.644523    3
17824   0.019540    0.977949    2
43536   8.253774    0.748700    3
67705   6.038620    1.509646    1
35283   6.091587    1.694641    3
71308   8.986820    1.225165    1
31054   11.508473   1.624296    3
52387   8.807734    0.713922    3
40328   0.000000    0.816676    1
34844   8.889202    1.665414    3
11607   3.178117    0.542752    2
64306   7.013795    0.139909    1
32721   9.605014    0.065254    3
33170   1.230540    1.331674    1
37192   10.412811   0.890803    3
13089   0.000000    0.567161    2
66491   9.699991    0.122011    1
15941   0.000000    0.061191    2
4272    4.455293    0.272135    2
48812   3.020977    1.502803    1
28818   8.099278    0.216317    3
35394   1.157764    1.603217    1
71791   10.105396   0.121067    1
40668   11.230148   0.408603    3
39580   9.070058    0.011379    3
11786   0.566460    0.478837    2
19251   0.000000    0.487300    2
56594   8.956369    1.193484    3
54495   1.523057    0.620528    1
11844   2.749006    0.169855    2
45465   9.235393    0.188350    3
31033   10.555573   0.403927    3
16633   6.956372    1.519308    2
13887   0.636281    1.273984    2
52603   3.574737    0.075163    1
72000   9.032486    1.461809    1
68497   5.958993    0.023012    1
35135   2.435300    1.211744    1
26397   10.539731   1.638248    3
7313    7.646702    0.056513    2
91273   20.919349   0.644571    1
24743   1.424726    0.838447    1
31690   6.748663    0.890223    3
15432   2.289167    0.114881    2
58394   5.548377    0.402238    1
33962   6.057227    0.432666    1
31442   10.828595   0.559955    3
31044   11.318160   0.271094    3
29938   13.265311   0.633903    3
9875    0.000000    1.496715    2
51542   6.517133    0.402519    3
11878   4.934374    1.520028    2
69241   10.151738   0.896433    1
37776   2.425781    1.559467    1
68997   9.778962    1.195498    1
67416   12.219950   0.657677    1
59225   7.394151    0.954434    1
29138   8.518535    0.742546    3
5962    2.798700    0.662632    2
10847   0.637930    0.617373    2
70527   10.750490   0.097415    1
9610    0.625382    0.140969    2
64734   10.027968   0.282787    1
25941   9.817347    0.364197    3
2763    0.646828    1.266069    2
55601   3.347111    0.914294    1
31128   11.816892   0.193798    3
5181    0.000000    1.480198    2
69982   10.945666   0.993219    1
52440   10.244706   0.280539    3
57350   2.579801    1.149172    1
57869   2.630410    0.098869    1
56557   11.746200   1.695517    3
42342   8.104232    1.326277    3
15560   12.409743   0.790295    3
34826   12.167844   1.328086    3
8569    3.198408    0.299287    2
77623   16.055513   0.541052    1
78184   7.138659    0.158481    1
7036    4.831041    0.761419    2
69616   10.082890   1.373611    1
21546   10.066867   0.788470    3
36715   8.129538    0.329913    3
20522   3.012463    1.138108    2
42349   3.720391    0.845974    1
9037    0.773493    1.148256    2
26728   10.962941   1.037324    3
587 0.177621    0.162614    2
48915   3.085853    0.967899    1
9824    8.426781    0.202558    2
4135    1.825927    1.128347    2
9666    2.185155    1.010173    2
59333   7.184595    1.261338    1
36198   0.000000    0.116525    1
34909   8.901752    1.033527    3
47516   2.451497    1.358795    1
55807   3.213631    0.432044    1
14036   3.974739    0.723929    2
42856   9.601306    0.619232    3
64007   8.363897    0.445341    1
59428   6.381484    1.365019    1
13730   0.000000    1.403914    2
41740   9.609836    1.438105    3
63546   9.904741    0.985862    1
30417   7.185807    1.489102    3
69636   5.466703    1.216571    1
64660   0.000000    0.915898    1
14883   4.575443    0.535671    2
7965    3.277076    1.010868    2
68620   10.246623   1.239634    1
8738    2.341735    1.060235    2
7544    3.201046    0.498843    2
6377    6.066013    0.120927    2
36842   8.829379    0.895657    3
81046   15.833048   1.568245    1
67736   13.516711   1.220153    1
32492   0.664284    1.116755    1
39299   6.325139    0.605109    3
77289   8.677499    0.344373    1
33835   8.188005    0.964896    3
71890   9.414263    0.384030    1
32054   9.196547    1.138253    3
38579   10.202968   0.452363    3
55984   2.119439    1.481661    1
72694   13.635078   0.858314    1
42299   0.083443    0.701669    1
26635   9.149096    1.051446    3
8579    1.933803    1.374388    2
37302   14.115544   0.676198    3
22878   8.933736    0.943352    3
4364    2.661254    0.946117    2
4985    0.988432    1.305027    2
37068   2.063741    1.125946    1
41137   2.220590    0.690754    1
67759   6.424849    0.806641    1
11831   1.156153    1.613674    2
34502   3.032720    0.601847    1
4088    3.076828    0.952089    2
15199   0.000000    0.318105    2
17309   7.750480    0.554015    3
42816   10.958135   1.482500    3
43751   10.222018   0.488678    3
58335   2.367988    0.435741    1
75039   7.686054    1.381455    1
42878   11.464879   1.481589    3
42770   11.075735   0.089726    3
8848    3.543989    0.345853    2
31340   8.123889    1.282880    3
41413   4.331769    0.754467    3
12731   0.120865    1.211961    2
22447   6.116109    0.701523    3
33564   7.474534    0.505790    3
48907   8.819454    0.649292    3
8762    6.802144    0.615284    2
46696   12.666325   0.931960    3
36851   8.636180    0.399333    3
67639   11.730991   1.289833    1
171 8.132449    0.039062    2
26674   10.296589   1.496144    3
8739    7.583906    1.005764    2
66668   9.777806    0.496377    1
68732   8.833546    0.513876    1
69995   4.907899    1.518036    1
82008   8.362736    1.285939    1
25054   9.084726    1.606312    3
33085   14.164141   0.560970    3
41379   9.080683    0.989920    3
39417   6.522767    0.038548    3
12556   3.690342    0.462281    2
39432   3.563706    0.242019    1
38010   1.065870    1.141569    1
69306   6.683796    1.456317    1
38000   1.712874    0.243945    1
46321   13.109929   1.280111    3
66293   11.327910   0.780977    1
22730   4.545711    1.233254    1
5952    3.367889    0.468104    2
72308   8.326224    0.567347    1
60338   8.978339    1.442034    1
13301   5.655826    1.582159    2
27884   8.855312    0.570684    3
11188   6.649568    0.544233    2
56796   3.966325    0.850410    1
8571    1.924045    1.664782    2
4914    6.004812    0.280369    2
10784   0.000000    0.375849    2
39296   9.923018    0.092192    3
13113   2.389084    0.119284    2
70204   13.663189   0.133251    1
46813   11.434976   0.321216    3
11697   0.358270    1.292858    2
44183   9.598873    0.223524    3
2225    6.375275    0.608040    2
29066   11.580532   0.458401    3
4245    5.319324    1.598070    2
34379   4.324031    1.603481    1
44441   2.358370    1.273204    1
2022    0.000000    1.182708    2
26866   12.824376   0.890411    3
57070   1.587247    1.456982    1
32932   8.510324    1.520683    3
51967   10.428884   1.187734    3
44432   8.346618    0.042318    3
67066   7.541444    0.809226    1
17262   2.540946    1.583286    2
79728   9.473047    0.692513    1
14259   0.352284    0.474080    2
6122    0.000000    0.589826    2
76879   12.405171   0.567201    1
11426   4.126775    0.871452    2
2493    0.034087    0.335848    2
19910   1.177634    0.075106    2
10939   0.000000    0.479996    2
17716   0.994909    0.611135    2
31390   11.053664   1.180117    3
20375   0.000000    1.679729    2
26309   2.495011    1.459589    1
33484   11.516831   0.001156    3
45944   9.213215    0.797743    3
4249    5.332865    0.109288    2
6089    0.000000    1.689771    2
7513    0.000000    1.126053    2
27862   12.640062   1.690903    3
39038   2.693142    1.317518    1
19218   3.328969    0.268271    2
62911   7.193166    1.117456    1
77758   6.615512    1.521012    1
27940   8.000567    0.835341    3
2194    4.017541    0.512104    2
37072   13.245859   0.927465    3
15585   5.970616    0.813624    2
25577   11.668719   0.886902    3
8777    4.283237    1.272728    2
29016   10.742963   0.971401    3
21910   12.326672   1.592608    3
12916   0.000000    0.344622    2
10976   0.000000    0.922846    2
79065   10.602095   0.573686    1
36759   10.861859   1.155054    3
50011   1.229094    1.638690    1
1155    0.410392    1.313401    2
71600   14.552711   0.616162    1
30817   14.178043   0.616313    3
54559   14.136260   0.362388    1
29764   0.093534    1.207194    1
69100   10.929021   0.403110    1
47324   11.432919   0.825959    3
73199   9.134527    0.586846    1
44461   5.071432    1.421420    1
45617   11.460254   1.541749    3
28221   11.620039   1.103553    3
7091    4.022079    0.207307    2
6110    3.057842    1.631262    2
79016   7.782169    0.404385    1
18289   7.981741    0.929789    3
43679   4.601363    0.268326    1
22075   2.595564    1.115375    1
23535   10.049077   0.391045    3
25301   3.265444    1.572970    2
32256   11.780282   1.511014    3
36951   3.075975    0.286284    1
31290   1.795307    0.194343    1
38953   11.106979   0.202415    3
35257   5.994413    0.800021    1
25847   9.706062    1.012182    3
32680   10.582992   0.836025    3
62018   7.038266    1.458979    1
9074    0.023771    0.015314    2
33004   12.823982   0.676371    3
44588   3.617770    0.493483    1
32565   8.346684    0.253317    3
38563   6.104317    0.099207    1
75668   16.207776   0.584973    1
9069    6.401969    1.691873    2
53395   2.298696    0.559757    1
28631   7.661515    0.055981    3
71036   6.353608    1.645301    1
71142   10.442780   0.335870    1
37653   3.834509    1.346121    1
76839   10.998587   0.584555    1
9916    2.695935    1.512111    2
38889   3.356646    0.324230    1
39075   14.677836   0.793183    3
48071   1.551934    0.130902    1
7275    2.464739    0.223502    2
41804   1.533216    1.007481    1
35665   12.473921   0.162910    3
67956   6.491596    0.032576    1
41892   10.506276   1.510747    3
38844   4.380388    0.748506    1
74197   13.670988   1.687944    1
14201   8.317599    0.390409    2
3908    0.000000    0.556245    2
2459    0.000000    0.290218    2
32027   10.095799   1.188148    3
12870   0.860695    1.482632    2
9880    1.557564    0.711278    2
72784   10.072779   0.756030    1
17521   0.000000    0.431468    2
50283   7.140817    0.883813    3
33536   11.384548   1.438307    3
9452    3.214568    1.083536    2
37457   11.720655   0.301636    3
17724   6.374475    1.475925    3
43869   5.749684    0.198875    3
264 3.871808    0.552602    2
25736   8.336309    0.636238    3
39584   9.710442    1.503735    3
31246   1.532611    1.433898    1
49567   9.785785    0.984614    3
7052    2.633627    1.097866    2
35493   9.238935    0.494701    3
10986   1.205656    1.398803    2
49508   3.124909    1.670121    1
5734    7.935489    1.585044    2
65479   12.746636   1.560352    1
77268   10.732563   0.545321    1
28490   3.977403    0.766103    1
13546   4.194426    0.450663    2
37166   9.610286    0.142912    3
16381   4.797555    1.260455    2
10848   1.615279    0.093002    2
35405   4.614771    1.027105    1
15917   0.000000    1.369726    2
6131    0.608457    0.512220    2
67432   6.558239    0.667579    1
30354   12.315116   0.197068    3
69696   7.014973    1.494616    1
33481   8.822304    1.194177    3
43075   10.086796   0.570455    3
38343   7.241614    1.661627    3
14318   4.602395    1.511768    2
5367    7.434921    0.079792    2
37894   10.467570   1.595418    3
36172   9.948127    0.003663    3
40123   2.478529    1.568987    1
10976   5.938545    0.878540    2
12705   0.000000    0.948004    2
12495   5.559181    1.357926    2
35681   9.776654    0.535966    3
46202   3.092056    0.490906    1
11505   0.000000    1.623311    2
22834   4.459495    0.538867    1
49901   8.334306    1.646600    3
71932   11.226654   0.384686    1
13279   3.904737    1.597294    2
49112   7.038205    1.211329    3
77129   9.836120    1.054340    1
37447   1.990976    0.378081    1
62397   9.005302    0.485385    1
0   1.772510    1.039873    2
15476   0.458674    0.819560    2
40625   10.003919   0.231658    3
36706   0.520807    1.476008    1
28580   10.678214   1.431837    3
25862   4.425992    1.363842    1
63488   12.035355   0.831222    1
33944   10.606732   1.253858    3
30099   1.568653    0.684264    1
13725   2.545434    0.024271    2
36768   10.264062   0.982593    3
64656   9.866276    0.685218    1
14927   0.142704    0.057455    2
43231   9.853270    1.521432    3
66087   6.596604    1.653574    1
19806   2.602287    1.321481    2
41081   10.411776   0.664168    3
10277   7.083449    0.622589    2
7014    2.080068    1.254441    2
17275   0.522844    1.622458    2
31600   10.362000   1.544827    3
View Code

python版:

#!/usr/bin/python
import sys
import copy
import getopt

def usage():
  print '''Help Information:
  -h, --help: show help information;
  -r, --train: train file;
  -t, --test: test file;
  -k, --kst: k number;
  '''

def getclass(nest_label):
  lendic = {}
  for i in range(0,len(nest_label),1):
    if lendic.has_key(nest_label[i]):
      lendic[nest_label[i]] += 1
    else:
      lendic[nest_label[i]] = 1

  vec = sorted(lendic.iteritems(),key=lambda asd:asd[1],reverse=True)

  maxlen = vec[0][1]
  maxidx = {}
  for i in range(0,len(vec),1):
    if vec[i][1]<maxlen:
      break
    idx = 0
    for k in range(0,len(nest_label),1):
      if nest_label[k]==vec[i][0]:
        idx = k
        break
    maxidx[vec[i][0]] = idx

  vecmax = sorted(maxidx.iteritems(),key=lambda asd:asd[1],reverse=False)

  return vecmax[0][0]

def classify(testdata,orgdata,maxdata,mindata,label,kst):
  for i in range(0,len(testdata),1):
    if testdata[i] > maxdata[i]:
      testdata[i] = 1.0
    elif testdata[i] < mindata[i]:
      testdata[i] = 0.0
    elif maxdata[i]!=mindata[i]:
      testdata[i] = float(testdata[i] - mindata[i]) / (maxdata[i] - mindata[i])
    
  dist_dic = {}
  for i in range(0,len(orgdata),1):
    dist = 0
    for k in range(0,len(orgdata[i]),1):
      dist += (orgdata[i][k] - testdata[k])**2
    dist = dist**0.5
    dist_dic[i] = dist
     
  dist_ls = sorted(dist_dic.iteritems(),key=lambda asd:asd[1],reverse=False)

  nest_label = []
  if kst>len(dist_ls):
    kst = len(dist_ls)
  for i in range(0,kst,1):
    index = dist_ls[i][0]
    nest_label.append(label[index])

  return getclass(nest_label)
  
def autoNorm(orgdata):
  mindata = copy.deepcopy(orgdata[0])
  maxdata = copy.deepcopy(orgdata[0])
  for i in range(1,len(orgdata),1):
    for k in range(0,len(mindata),1):
      if orgdata[i][k] > maxdata[k]:
        maxdata[k] = orgdata[i][k]
      if orgdata[i][k] < mindata[k]:
        mindata[k] = orgdata[i][k]

  for i in range(0,len(orgdata),1):
    for k in range(0,len(orgdata[i]),1):
      if maxdata[k]!=mindata[k]:
        orgdata[i][k] = float(orgdata[i][k] - mindata[k]) / (maxdata[k] - mindata[k])
  return mindata,maxdata

if __name__=="__main__":
  #set parameter
  try:
    opts, args = getopt.getopt(sys.argv[1:], "hr:t:k:", ["help", "train=","test=","kst="])
  except getopt.GetoptError, err:
    print str(err)
    usage()
    sys.exit(1)

  sys.stderr.write("\ntrain.py : a python script for perception training.\n")
  sys.stderr.write("Copyright 2016 sxron, search, Sogou. \n")
  sys.stderr.write("Email: shixiang08abc@gmail.com \n\n")

  train = ''
  test = ''
  kst = 10
  for i, f in opts:
    if i in ("-h", "--help"):
      usage()
      sys.exit(1)
    elif i in ("-r", "--train"):
      train = f
    elif i in ("-t", "--test"):
      test = f
    elif i in ("-k", "--kst"):
      kst = int(f)
    else:
      assert False, "unknown option"
  
  print "start trian parameter \ttrain:%s\ttest:%s\tkst:%d" % (train,test,kst)

  #read train file
  orgdata = []
  label = []
  fin = open(train,'r')
  while 1:
    line = fin.readline()
    if not line:
      break
    ts = line.strip().split('\t')
    if len(ts)==4:
      try:
        lb1 = float(ts[0].strip())
        lb2 = float(ts[1].strip())
        lb3 = float(ts[2].strip())
        lb = int(ts[3].strip())
      except:
        continue
      data = []
      data.append(lb1)
      data.append(lb2)
      data.append(lb3)
      orgdata.append(data)
      label.append(lb)
  fin.close()

  #read test file
  testdata = []
  rlabel = []
  fin = open(test,'r')
  while 1:
    line = fin.readline()
    if not line:
      break
    ts = line.strip().split('\t')
    if len(ts)==4:
      try:
        lb1 = float(ts[0].strip())
        lb2 = float(ts[1].strip())
        lb3 = float(ts[2].strip())
        lb = int(ts[3].strip())
      except:
        continue
      data = []
      data.append(lb1)
      data.append(lb2)
      data.append(lb3)
      testdata.append(data)
      rlabel.append(lb)
  fin.close()

  mindata,maxdata = autoNorm(orgdata)

  errnum = 0
  for i in range(0,len(testdata),1):
    result_label = classify(testdata[i],orgdata,maxdata,mindata,label,kst)
    if result_label!=rlabel[i]:
      print "index:%d\tresult_label:%d\trlabel:%d" % (i,result_label,rlabel[i])
      errnum += 1
  print "errnum:%d\terrratio:%f" % (errnum,float(errnum)/len(testdata))

PS:这些数据的背后意义:某个妹纸想要找一个男朋友,她收集了婚恋网上面对她感兴趣男士的一些数据,第一个是一年的飞行里程,第二个是每年玩游戏的时间(10小时为单位),第三个是每年吃的冰淇淋量。第四个是结果:3表示很感兴趣,2表示感兴趣,1表示没兴趣。

 

posted on 2016-05-02 09:47  sxron  阅读(44317)  评论(0编辑  收藏  举报