K-近邻算法

简单的说,K-近邻算法采用测量不同特征值之间的距离方法进行分类。

优点:精度高、对异常值不敏感、无数据输入假定。
缺点:计算复杂度高、空间复杂度高。
适用数据范围:数值型和标称型。

K-近邻算法的一般流程:

对未知类别属性的数据集中的每个点依次执行以下操作:

  • 计算已知类别数据集中的点与当前点之间的距离;
  • 按照距离递增次序排序;
  • 选取与当前点距离最小的k个点;
  • 确定前k个点所在类别的出现频率;
  • 返回前k个点出现频率最高的类别作为当前点的预测分类。

分类代码如下:

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    #copy and copy inX by rows to make it have the same size of dataSet and the cacul                                  the diff.
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distance = sqDistances**0.5
    #sorting the distances(ascend) and get the  corresding index that located in the                          unsorted matrix
    sortedDistIndicies = distance.argsort()
    #the dict represents the class labels with  specific count
    classCount = {}
    for i in range(k):
        #get corresponding labels for k minimual  distances 
        voteIlabel = labels[sortedDistIndicies[i]]
        #sorting the dict by the values and return  the label with highest frenquency
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    sortedClassCount = sorted(classCount.iteritems(), 
                              key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

附加处理代码如下:

文本中解析数据:

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)
    returnMat = np.zeros((numberOfLines, 3))    
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

归一化特征值:

def autoNorm(dataSet):
    #params "0" makes funtion get min values by columns
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet/np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

应用:手写识别系统

准备数据:将图像转换为测试向量
该函数创建1x1024的Numpy数组,然后打开给定的文件,循环读出文件的前32行,并将每行的的头32个字符值存储在Numpy数组中,并返回数组。

def img2vector(filename):
    returnVect = np.zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

手写数字识别系统的测试代码:

def handwritingClassTest():
    hwLabels = []
    trainingFileList = os.listdir('trainingDigits')
    m = len(trainingFileList)
    trainingMat = np.zeros((m, 1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('trainingDigits\%s' % fileNameStr)
    testFileList = os.listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits\%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print "the classifier came back with: %d, the real answer is: %d" %               (classifierResult, classNumStr)
        if (classifierResult != classNumStr):
            errorCount += 1.0
    print "\nthe total number of error is: %d" %errorCount
    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

小结:

k-近邻算法是分类数据最简单最有效的算法,属于监督分类。但也有不足:

  • 必须保存全部数据,训练数据集很大时,需要耗费大量的存储空间。
  • 由于必须对数据集中的每个数据计算距离值,实际使用时可能非常耗时。
  • 无法给出任何数据的基础结构信息,因此无法知晓平均实例样本和典型样本具有什么特征。
posted @ 2016-05-31 21:16  Chris*Chen  阅读(312)  评论(0编辑  收藏  举报