ML_inAction—Chapetr2--(3) kNN之手写识别系统

1.准备数据——将图像转化为向量

先回顾一下文件操作：read()、readline()、readlines()

file.read()：

　　特点是：读取整个文件，将文件内容放到一个字符串变量中。

　　劣势是：如果文件非常大，尤其是大于内存时，无法使用read()方法。

file = open('兼职模特联系方式.txt', 'r')  # 创建的这个文件，也是一个可迭代对象

try:
    text = file.read()  # 结果为str类型
    print(type(text))
    print(text)
finally:
    file.close()
"""
<class 'str'>
吴迪 177 70 13888888
王思 170 50 13988888
白雪 167 48 13324434
黄蓉 166 46 13828382
"""

read()直接读取字节到字符串中，包括了换行符

>>> file = open('兼职模特联系方式.txt', 'r')
>>> a = file.read()
>>> a
'吴迪 177 70 13888888\n王思 170 50 13988888\n白雪 167 48 13324434\n黄蓉 166 46 13828382'

file.readline()：

　　特点：readline()方法每次读取一行；返回的是一个字符串对象，保持当前行的内存

　　缺点：比readlines慢得多

file = open('兼职模特联系方式.txt', 'r')

try:
    while True:
        text_line = file.readline()
        if text_line:
            print(type(text_line), text_line)
        else:
            break
finally:
    file.close()
"""
<class 'str'> 吴迪 177 70 13888888

<class 'str'> 王思 170 50 13988888

<class 'str'> 白雪 167 48 13324434

<class 'str'> 黄蓉 166 46 13828382
"""

readline() 读取整行，包括行结束符，并作为字符串返回

>>> file = open('兼职模特联系方式.txt', 'r')
>>> a = file.readline()
>>> a
'吴迪 177 70 13888888\n'

file.readlines()：

　　特点：一次性读取整个文件；自动将文件内容分析成一个行的列表。

file = open('兼职模特联系方式.txt', 'r')

try:
    text_lines = file.readlines()
    print(type(text_lines), text_lines)
    for line in text_lines:
        print(type(line), line)
finally:
    file.close()
"""
<class 'list'> ['吴迪 177 70 13888888\n', '王思 170 50 13988888\n', '白雪 167 48 13324434\n', '黄蓉 166 46 13828382']
<class 'str'> 吴迪 177 70 13888888

<class 'str'> 王思 170 50 13988888

<class 'str'> 白雪 167 48 13324434

<class 'str'> 黄蓉 166 46 13828382
"""

readlines()读取所有行然后把它们作为一个字符串列表返回。

>>> file = open('兼职模特联系方式.txt', 'r')
>>> a = file.readlines()
>>> a
['吴迪 177 70 13888888\n', '王思 170 50 13988888\n', '白雪 167 48 13324434\n', '黄蓉 166 46 13828382']

书中已经将图像转换成文本，现在要把一手写图像对应的文本换成向量格式。

#程序清单 将图像转化为向量
'''
    parameter explain:
    img_x:row of img in txt
    img_y:column of img in txt
'''
def img2vector(filename,img_x,img_y):
    returnVector = zeros((1,img_x*img_y))
    with open(filename) as f:
        for i in range(img_x):
            line_txt = f.readline()
            for j in range(img_y):
                returnVector[0,img_x*i+j] = int(line_txt[j])
    return returnVector


#2.3.1 test
import kNN 
testVector = kNN.img2vector('testDigits/0_13.txt',32,32)
print testVector[0,:31]

#result
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]

2.手写识别系统的测试代码

先回顾一下Python os.listdir() 方法：

os.listdir() 方法用于返回指定的文件夹包含的文件或文件夹的名字的列表。这个列表以字母顺序。它不包括 '.' 和'..' 即使它在文件夹中。

#程序清单2.6 分类器测试代码——基于手写数字识别
from os import listdir
'''
    k:the k of kNN
'''
def handwriting_test(k):
    trainSet_filelist = listdir('trainingDigits')
    m = len(trainSet_filelist)
    train_data = zeros((m,1024))
    train_label = []
    for i in range(m):
        filename = trainSet_filelist[i]
        train_data[i,:] = img2vector('trainingDigits/%s'%filename,32,32)   #对应元素相等
        train_label.append(int(filename[0]))
    #got traindata and its label
    #test  
    testSet_filelist = listdir('testDigits')
    test_num = len(testSet_filelist)
    test_X = zeros((0,1024))
    test_label = []
    error_count = 0
    for i in range(test_num):
        filename = testSet_filelist[i]
        test_X = img2vector('testDigits/%s'%filename,32,32)
        test_label = int(filename[0])
        class_predict = classify0(test_X,train_data,train_label,k)
        if class_predict != test_label:
            error_count += 1
            print "the class_predict is: %d ,but the real class is: %d"\
            %(class_predict,test_label)
        
    error_rate = float(error_count/test_num)
    print error_count,test_num,error_rate


#test
import kNN 

k = 3
kNN.handwriting_test(k)


#result
'''
the class_predict is: 7 ,but the real class is: 1
the class_predict is: 9 ,but the real class is: 3
the class_predict is: 9 ,but the real class is: 3
the class_predict is: 3 ,but the real class is: 5
the class_predict is: 6 ,but the real class is: 5
the class_predict is: 6 ,but the real class is: 8
the class_predict is: 3 ,but the real class is: 8
the class_predict is: 1 ,but the real class is: 8
the class_predict is: 1 ,but the real class is: 8
the class_predict is: 1 ,but the real class is: 9
the class_predict is: 7 ,but the real class is: 9
11 946 0.0
'''
#最后错误率为0.0不知道是何原因？

3.本章小结

kNN很简单，确定k、距离度量、评估方法即可实现。

分类结果受k值影响很大，却又缺乏合理选择k的方法（也许有，小白母鸡）。此外，数据维度大，数量多时，计算代价很大，kd树能改善，不过树结构还不了解，下一步准备恶补一下数据结构算法，然后继续学习！

希望后面章节的代码实现不要太复杂啊，复现代码很耗时间啊。。。

posted @ 2018-10-30 00:26 bo0814 阅读(78) 评论(0) 收藏举报

刷新页面返回顶部

ML_inAction—Chapetr2--(3) kNN之手写识别系统

公告