[Machine Learning] kNN近邻算法

KNN算法的原理很简单：

　　1. 物以类聚，人以群分：最直观的证据就是你离谁近，所以你们是一类的。

　　2. 为了防止异类（特殊情况）：取最近的N个点，算概率。

所以算法的大致过程：

　　计算预测数据与每一条训练集（其实并没有经过训练）的距离，然后对结果进行排序。取距离最小的N个点，统计这N歌点每个类出现的次数，对次数进行排序。预测结果就是在N个点中出现最多的类

from numpy import *
import operator

def createDataSet() :
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 1.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels


'''
    tile(array, (intR, intC): 对矩阵进行组合，纵向复制intR次， 横向复制intC次
    比如 : tile([1,2,3], (3, 2))
    输出
    [
        [1, 2, 3, 1, 2, 3],
        [1, 2, 3, 1, 2, 3],
        [1, 2, 3, 1, 2, 3]
    ]
    array减法， 两个 行列数相等的矩阵，对应位置做减法

    argsort(array, axis) 对矩阵进行排序，axis=0 按列排序， axis=1 按行排序  输出的是排序的索引。比如输出[0,2,1], 排序结果结果为 array[0],array[2].array[1]

    aorted(iteratorItems, key, reverse)  对可迭代的对象进行排序

'''
def classify0(intX, dataSet, labels, k) :    # 假设输入 intX = [0, 0], dataSet = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 1.1]]), labels = ['A', 'A', 'B', 'B'], k = 3
    dataSetSize = dataSet.shape[0];    # 行数 dataSetSize = 4
    diffMat = tile(intX, (dataSetSize, 1)) - dataSet    # 矩阵差  diffMat = array([[-1.0, -1.1], [-1.0, -1.0], [0, 0], [0, -1.1]])
    sqDiffMat = diffMat ** 2    # 平方   sqDiffMat = array([[1, 1.21], [1, 1], [0, 0], [0, 1.21]])
    sqDistances = sqDiffMat.sum(axis=1)    # 行和，axis=0时输出纵和  sqDistances = array([2.21, 2, 0, 1.21])
    distances = sqDistances ** 0.5    # 开平方 distances = array([1.41, 1.48, 0, 1.1])
    sortedDistIndicies = distances.argsort()    # 排序 sortedDistIndicies = array([2, 3, 0, 1])
    classCount = {}
    for i in range(k) :
        voteIlabel = labels[sortedDistIndicies[i]]   
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1    # {label:count}， 取距离最小的三个， 统计label出现的次数，最终 classCount = {'B': 2, 'A': 1}
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)    # 对dict按照v值（distance）进行倒序排序 sortedClassCount = [('B', 2), ('A', 1)]
    return sortedClassCount[0][0]  # 返回第一个tuple的第一个值，也就是出现次数最高的label， 这里返回‘B’


if __name__ == "__main__":
    group, labels = createDataSet()
    classify0([0, 0], group, labels, 3)

'''
    从txt文件中读取数据。 将前三个数据（输入）存储在 returnMat 中， 第四个数据存储在 classLabelVector 中
'''
def file2matrix(filename) : 
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)    # 文本行数
    returnMat = zeros((numberOfLines, 3))    # 用0填充的 文本行数x3 的矩阵
    classLabelVector = []  
    index = 0
    for line in arrayOLines:    # 遍历每一行数据
        line = line.strip()    # 去掉换行符
        listFromLine = line.split('\t')    # 通过空格分割数据， 返回一个list
        returnMat[index, :] = listFromLine[0:3]    # 将前三个数据存储为 returnMat 矩阵的一行 
        classLabelVector.append(int(listFromLine[-1]))    # 将最后一个数据存储在classLabelVector
        index += 1
    return returnMat, classLabelVector

'''
    归一化处理，这个地方有一点问题，
    这样处理完所有的数据肯定是在0-1之间的，我觉得这样明显是改变矩阵的特征值和特征向量的，至少我现在还没证明这个处理过程是符合矩阵的恒等变换的。
'''
def autoNorm(dataSet) :
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxValue - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m, 1))
    normDataSet = normDataSet / tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

'''
    约会数据测试
'''
def datingClassTest() :
    hoRatio = 0.10    # 提取 0.10 也就是 10% 的数据作为测试集
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')    # 读取数据
    normMat, ranges, minVals = autoNorm(datingDataMat)    # 归一化处理
    m = normMat.shape[0]    # 总数据量
    numTestVecs = int(m * hoRatio)    # 测试集数据量
    errorCount = 0.0
    for i in range(numTestVecs):
        # 显然测试集的数据为前10%， 训练集的数据为剩下的90%， 取距离最小的3个数据作为预测集
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs : m, :], datingLabels[numTestVecs : m], 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]) : errorCount += 1.0
    print("the total error rate is: %f" % (errorCount/float(numTestVecs)))    # 错误率

测试

yeyeck@ubuntu:~/yeyeck$ python3
Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import kNN
>>> kNN.datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000

posted @ 2018-11-03 15:28 早起的虫儿去吃鸟阅读(249) 评论(0) 收藏举报

刷新页面返回顶部

莫问今朝

每天学习一点点，进步一点点，不管是什么知识

[Machine Learning] kNN近邻算法

公告

莫问今朝

每天学习一点点，进步一点点， 不管是什么知识

[Machine Learning] kNN近邻算法

公告

每天学习一点点，进步一点点，不管是什么知识