# 应用kNN算法预测豆瓣电影用户的性别

## 实验数据

X1,1，X1,2，X1,3，X1,4……X1,36，X1,37，Y1
X2,1，X2,2，X2,3，X2,4……X2,36，X2,37，Y2
…………
X274,1，X274,2，X274,3，X274,4……X274,36，X274,37，Y274

0,0,0,3,1,34,5,0,0,0,11,31,0,0,38,40,0,0,15,8,3,9,14,2,3,0,4,1,1,15,0,0,1,13,0,0,1,1 0,1,0,2,2,24,8,0,0,0,10,37,0,0,44,34,0,0,3,0,4,10,15,5,3,0,0,7,2,13,0,0,2,12,0,0,0,0

## kNN算法

k-近邻算法（KNN），是最基本的分类算法，其基本思想是采用测量不同特征值之间的距离方法进行分类。

X_j=(X_j-min_j)/(max_j-min_j) 。

distance_i_j=sqrt((Xi,1-Xj,1)^2+(Xi,2-Xj,2)^2+……+(Xi,37-Xj,37)^2) ，

## 实验结果

 k 1 3 5 7 测试集1 62.96% 81.48% 70.37% 77.78% 测试集2 66.67% 66.67% 59.26% 62.96% 测试集3 62.96% 74.07% 70.37% 74.07% 平均值 64.20% 74.07% 66.67% 71.60%

## Python代码

2016/03 更新：自己重新实现了一下kNN的代码，对上次的算法一小处（从k个近邻中选择频率最高的一项）做了简化。

from numpy import *

#打开数据文件，导出为矩阵，其中最后一列为类别
def fileToMatrix(filename, sep=','):
f = open(filename)
f.close()

first_line_list = content[0].strip().split(sep)

data_matrix = zeros( (len(content), len(first_line_list)-1) )
label_vector = []

index = 0
for line in content:
list_from_line = line.strip().split(sep)
data_matrix[index,:] = list_from_line[0:-1]
label_vector.append(int(list_from_line[-1]))
index += 1

return (data_matrix,label_vector)

def classify(inX, data_matrix, label_vector, k):
diff_matrix = inX - data_matrix
square_diff_matrix = diff_matrix ** 2
square_distances = square_diff_matrix.sum(axis=1)

sorted_indicies = square_distances.argsort()

label_count = {}

for i in range(k):
cur_label = label_vector[ sorted_indicies[i] ]
label_count[cur_label] = label_count.get(cur_label, 0) + 1

max_count = 0
nearest_label = None
for label in label_count:
count = label_count[label]
if count > max_count:
max_count = count
nearest_label = label
return nearest_label

def test(filename,k=3,sep=',',hold_ratio=0.3):
data_matrix, label_vector = fileToMatrix(filename,sep=sep)

data_num = data_matrix.shape[0]
test_num = int(hold_ratio * data_num)
train_num = data_num - test_num

train_matrix = data_matrix[0:train_num,:]
test_matrix = data_matrix[train_num:,:]

train_label_vector = label_vector[0:train_num]
test_label_vector = label_vector[train_num:]

right_count = 0
for i in range(test_num):
inX = test_matrix[i,:]

classify_result = classify(inX, train_matrix, train_label_vector, k)
if classify_result == test_label_vector[i]:
right_count += 1
print("  The classifier came back with: %d, the real answer is: %d" % (classify_result, test_label_vector[i]))

accuracy = float(right_count)/float(test_num)
print('The total accuracy is %f' % accuracy)

## 参考文献

（美）Peter Harrington；李锐，李鹏，曲亚东，王斌译者. 机器学习实战. 北京：人民邮电出版社, 2013.06.

posted @ 2015-10-07 10:44 夏方方 阅读(...) 评论(...) 编辑 收藏