《机器学习系统设计》（1）

来自书籍《Building Machine Learning Systems with Python 》

本书主要在于如何实际的教用户来学习ml，其中简单的介绍了ml的原理，重点还是放在使用python和numpy、scipy、scikit-learn等包的使用上。通过简单的实例来讲解，还算是有趣。正如豆瓣上说的：

机器学习理论的经典教材很多，但讲经典的理论如何实现的好书就不那么多了。用python做机器学习的书，《集体智慧编程》《机器学习实战》算是佼佼者，但这些书都是讲的怎么自己造轮子。而造出来的轮子在实际工程中，几乎是没有实用价值的。实际做机器学习项目时，用的往往都是现成的高效模型，或在这些模型基础上做一些改进。如用python做机器学习，常会用到scikit-learn、numpy、scipy、matplotlib这些库，但除了官方文档，几乎没有书系统的阐述这些东东的工程级应用。这本书的出现，填补了这一空白。这本书是给工程师看的，典型的快餐式书。 看了太多细火慢炖的东西，吃顿快餐，能迅速止饿。
　　==================================================
　　优点：这本书告诉了你，python机器学习工业级的应用
　　缺点：凭空用了许多函数，却没有告诉你函数参数的意义

所以本书的真实意义就在于止饿，看过了就行了。希望借博客形式将里面有价值的一次提取，以后就不需要翻看那些冗余的部分了。

第一章引言

这一章主要介绍的就是本书的开始，最开始简单介绍了下numpy和scipy这两个python包，书中也说了，对于python来说，适合来做算法的建模和预测，等到了需要的时候，再将算法重写成c 或者cpp，不过随着python的一些代码包被重写成c形式，在速度上还是能够满足的。还介绍了ML的流程：

1、获取数据，然后洗数据；

2、探索和理解输入数据；

3、分析如何最好的将数据表达在学习算法中；

4、选择正确的模型和学习算法；

5、正确的测量模型的效果。

不过对于学习算法和模型的选择，并不是那么容易简单的，需要考虑的部分有很多，具体的就不说了，用到的时候自然会更加明白。并且介绍了几个查资料的好去处：

1、http://metaoptimize.com/qa – 该站点上几乎每一个问题，都是机器学习专家差不多认同的答案。即使你没有任何问题，也还是能够从中学到不少知识的。
2、 http://stats.stackexchange.com – 该问题查询站点,也叫做交叉验证( Cross Validated)，类似于元优化(MetaOptimized),上面的问题都主要偏向于统计学的问题。
3、http://stackoverflow.com –该站点和上面那个差不多，不过更多关注编程方面。
4、 Freenode上的#machinelearning频道 – 很小不过却很活跃，有不少ML社区的专家。（个人：玩不来）
5、http://www.TwoToReal.com – 这是本书作者们建立的站点。

6、书中推荐的博客http://blog.kaggle.com，强烈推荐的，上面有着比赛，还有很多冠军对自己算法的讲解，如果其他地方不去的话，这个地方强烈推荐去看看。

不过书中说到了，因为Numpy和scipy中一些算法是通过c重写的，所以很快，而python自带的一些操作很慢，所以在使用之前，最好先去scipy查查有没有对应的算法，直接使用。比如scipy是支持矩阵操作、线性代数、优化、聚类、空间操作、快速傅里叶变换等等。而且scipy是基于numpy的，所以下面的语句是正确的：

>>> scipy.dot is numpy.dot
True

不同的算法对应着有不同的工具箱：

本章的例子是假设网站新成立，需要通过每个小时的访问人数来决定什么时候更换更大的服务器，而较早更换太花钱，较晚更换又会使得网站瘫痪；所以如何很好的预测服务器的负载上限时间，还是很有意义的；原始数据显示结果为：

下面就是能够说明本章的代码完整版：

import os
import scipy as sp
import matplotlib.pyplot as plt

data_dir = os.path.join(
    os.path.dirname(os.path.realpath(__file__)), "..", "data")
data = sp.genfromtxt(os.path.join(data_dir, "web_traffic.tsv"), delimiter="\t")#这里的tsv是个txt格式文档，其中是个743*2的矩阵，其中第一列为1:1:743；第二列为每个小时的网站访问次数。该函数第一个为文件的路径及文件；第二个以‘\t'作为分隔符。
print(data[:10])

# all examples will have three classes in this file
colors = ['g', 'k', 'b', 'm', 'r']
linestyles = ['-', '-.', '--', ':', '-']

x = data[:, 0]#读取第一列
y = data[:, 1]#读取第二列
print("Number of invalid entries:", sp.sum(sp.isnan(y)))#其中isnan函数返回参数代表的数组中是否是非数值的bool矩阵
x = x[~sp.isnan(y)]#前面加~将返回的bool索引数组取反，这样就能将那些数值的元素给索引出来
y = y[~sp.isnan(y)]

# plot input data

#在python中可以通过matplotlib来模仿matlab中的figure的画图功能，下面是定义一个画数据的函数，具体细节可参阅matplotlib官网说明
def plot_models(x, y, models, fname, mx=None, ymax=None, xmin=None):
    plt.clf()
    plt.scatter(x, y, s=10)
    plt.title("Web traffic over the last month")#图的标题
    plt.xlabel("Time")
    plt.ylabel("Hits/hour")
    plt.xticks(
        [w * 7 * 24 for w in range(10)], ['week %i' % w for w in range(10)])#x轴的刻度位置：标签

    if models:
        if mx is None:
            mx = sp.linspace(0, x[-1], 1000)
        for model, style, color in zip(models, linestyles, colors):
            # print "Model:",model
            # print "Coeffs:",model.coeffs
            plt.plot(mx, model(mx), linestyle=style, linewidth=2, c=color)

        plt.legend(["d=%i" % m.order for m in models], loc="upper left")

    plt.autoscale(tight=True)
    plt.ylim(ymin=0)
    if ymax:
        plt.ylim(ymax=ymax)
    if xmin:
        plt.xlim(xmin=xmin)
    plt.grid(True, linestyle='-', color='0.75')
    plt.savefig(fname)

# 第一次看到数据的时候采用不同阶数的多项式进行拟合
plot_models(x, y, None, os.path.join("..", "1400_01_01.png"))

# create and plot models
fp1, res, rank, sv, rcond = sp.polyfit(x, y, 1, full=True)#多项式拟合函数，返回的fp1为按照第三个参数（多项式的阶数）指定的多项式的各个系数
print("Model parameters: %s" % fp1)
print("Error of the model:", res)##输出拟合后模型的残差
f1 = sp.poly1d(fp1)#为y = kx+b
f2 = sp.poly1d(sp.polyfit(x, y, 2))#拟合的为y =ax^2+bx+c 
f3 = sp.poly1d(sp.polyfit(x, y, 3))
f10 = sp.poly1d(sp.polyfit(x, y, 10))
f100 = sp.poly1d(sp.polyfit(x, y, 100))

plot_models(x, y, [f1], os.path.join("..", "1400_01_02.png"))
plot_models(x, y, [f1, f2], os.path.join("..", "1400_01_03.png"))
plot_models(
    x, y, [f1, f2, f3, f10, f100], os.path.join("..", "1400_01_04.png"))

#通过分析数据的走势，然后在决定使用两段直线来拟合，在发现week3,5的时候前后的数据呈现一个不同的走势
inflection = 3.5 * 7 * 24 #x轴的分段点
xa = x[:inflection]
ya = y[:inflection]
xb = x[inflection:]
yb = y[inflection:]

fa = sp.poly1d(sp.polyfit(xa, ya, 1))#直线拟合
fb = sp.poly1d(sp.polyfit(xb, yb, 1))#直线拟合

plot_models(x, y, [fa, fb], os.path.join("..", "1400_01_05.png"))

#定义计算残差的函数
def error(f, x, y):
    return sp.sum((f(x) - y) ** 2)

print("Errors for the complete data set:")
for f in [f1, f2, f3, f10, f100]:
    print("Error d=%i: %f" % (f.order, error(f, x, y)))

print("Errors for only the time after inflection point")
for f in [f1, f2, f3, f10, f100]:
    print("Error d=%i: %f" % (f.order, error(f, xb, yb)))

print("Error inflection=%f" % (error(fa, xa, ya) + error(fb, xb, yb)))


# extrapolating into the future
plot_models(
    x, y, [f1, f2, f3, f10, f100], os.path.join("..", "1400_01_06.png"),
    mx=sp.linspace(0 * 7 * 24, 6 * 7 * 24, 100),
    ymax=10000, xmin=0 * 7 * 24)

print("Trained only on data after inflection point")
fb1 = fb
fb2 = sp.poly1d(sp.polyfit(xb, yb, 2))
fb3 = sp.poly1d(sp.polyfit(xb, yb, 3))
fb10 = sp.poly1d(sp.polyfit(xb, yb, 10))
fb100 = sp.poly1d(sp.polyfit(xb, yb, 100))

print("Errors for only the time after inflection point")
for f in [fb1, fb2, fb3, fb10, fb100]:
    print("Error d=%i: %f" % (f.order, error(f, xb, yb)))

plot_models(
    x, y, [fb1, fb2, fb3, fb10, fb100], os.path.join("..", "1400_01_07.png"),
    mx=sp.linspace(0 * 7 * 24, 6 * 7 * 24, 100),
    ymax=10000, xmin=0 * 7 * 24)

# 通过将所有数据划分成训练集合测试集来选择多项式拟合的阶数
frac = 0.3
split_idx = int(frac * len(xb))
shuffled = sp.random.permutation(list(range(len(xb))))
test = sorted(shuffled[:split_idx])
train = sorted(shuffled[split_idx:])
fbt1 = sp.poly1d(sp.polyfit(xb[train], yb[train], 1))#一阶多项式拟合后得到的多项式形式
fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2))#2阶多项式拟合后得到的多项式形式
fbt3 = sp.poly1d(sp.polyfit(xb[train], yb[train], 3))#3阶多项式
fbt10 = sp.poly1d(sp.polyfit(xb[train], yb[train], 10))
fbt100 = sp.poly1d(sp.polyfit(xb[train], yb[train], 100))

print("Test errors for only the time after inflection point")
for f in [fbt1, fbt2, fbt3, fbt10, fbt100]:
    print("Error d=%i: %f" % (f.order, error(f, xb[test], yb[test])))

plot_models(
    x, y, [fbt1, fbt2, fbt3, fbt10, fbt100], os.path.join("..",
                                                          "1400_01_08.png"),
    mx=sp.linspace(0 * 7 * 24, 6 * 7 * 24, 100),
    ymax=10000, xmin=0 * 7 * 24)

from scipy.optimize import fsolve#通过fsolve函数来寻找达到最大值时候的解，即得到什么时候是服务器负载上限的日子
print(fbt2)
print(fbt2 - 100000)
reached_max = fsolve(fbt2 - 100000, 800) / (7 * 24)#参数：函数形式；自变量初始值
print("100,000 hits/hour expected at week %f" % reached_max[0])

在上面的代码中，第一章的有价值部分是：

1、sp.ployfit()函数，该函数为scipy提供的多项式拟合函数：参数形式为：x，y，阶数，是否完整输出模型参数。

2、拿到数据之后最好先了解下，正如上面代码的第二部分，在观察数据的走势发现，使用分段形式，在3.5周前后分别使用直线拟合，最后得到的效果最好（这是在第四周之后的事情），即在真实世界中的影响因素很多，虽然采用多项式拟合100阶时候过拟合，1阶欠拟合，而且分段形式的error还大，但是根据作者后期对网站的数据观察，分段的效果最好，当然这不是说什么都是分段，只是在做数据预测的时候，最好多个模型都做评估，而不只要单单评估一个模型，这样在商业上，后期才能有很好的方案遇见，毕竟机器学习也只是基于当前的数据的预测。

参考资料：

1、python使用matplotlib绘图：http://www.cnblogs.com/qianlifeng/archive/2012/02/13/2350086.html

2、科学计算：最优化问题：http://blog.sina.com.cn/s/blog_5f234d4701013ln6.html

3、scipy 数值计算库：http://www.tuicool.com/articles/rmauqun

第二章

首先介绍了一个1930年的花朵数据集，Iris（https://archive.ics.uci.edu/ml/datasets/Iris），其中包含150个样本，每个样本4个特征（花萼长度，花萼宽度；花瓣长度，花瓣宽度）。第一步就是ML需要的可视化，因为很多时候，我们是需要通过肉眼来观察数据的内在走势的，然后如第一章一样挑选不同的模型。不过这个数据集被sklearn自带了，不需要去官网上下载。

import numpy as np
from sklearn.datasets import load_iris  #如果安装的是anaconda的话会自带
from matplotlib import pyplot as plt    #或者import matplotlib.pyplot as plt

data = load_iris()                      #直接装载iris数据集
features = data['data']                 #或者features = data.data（这即可以作为字典的键索取，也被定为该变量的属性）
feature_names = data['feature_names']
target = data['target']

#下面是画出6个子图的部分
pairs = [(0,1),(0,2),(0,3),(1,2),(1,3),(2,3)]
for i,(p0,p1) in enumerate(pairs):    #enumerate函数在《howto-functional》中有介绍
    plt.subplot(2,3,i+1)
    for t,marker,c in zip(range(3),">ox","rgb"):
        plt.scatter(features[target == t,p0], features[target == t,p1], marker=marker, c=c)
    plt.xlabel(feature_names[p0])
    plt.ylabel(feature_names[p1])
    plt.xticks([])
    plt.yticks([])
plt.savefig('../1400_02_01.png')

1、enumerate(iter)函数会计算可迭代元素的个数，然后返回一个2-元组，分别为：计数值；每个元素。

在可视化之后，肉眼发现，iris数据集中setosa花类别明显和其他两个分离，所以直接通过那个特征将该花分离出来，使得数据集变成了2分类形式：

import numpy as np
from sklearn.datasets import load_iris

data = load_iris()
features = data['data']                   #读取特征矩阵
labels = data['target_names'][data['target']]   #即name[target]，两个都是numpy.ndarray类型

plength = features[:,2]                   #肉眼观察可以直接区分的特征
is_setosa = (labels == 'setosa')          #生成标签等于‘setosa’的bool索引矩阵
print('Maximum of setosa: {0}.'.format(plength[is_setosa].max()))              #输出setosa的最大上限
print('Minimum of others: {0}.'.format(plength[~is_setosa].min()))             #输出其他两类的最小下限

只要满足下面的就是setosa类

if features[:,2] <2 : print 'setosa'

然后选择剩下的两类数据：

from matplotlib import pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from threshold import learn_model, apply_model, accuracy #从阈值模块中装载模型学习，使用，测试函数

data = load_iris()                               #装载数据
features = data['data']                          #读取特征部分
labels = data['target_names'][data['target']]    #将【0 1 2】标签换成名称


setosa = (labels == 'setosa')                   #生成bool索引矩阵
features = features[~setosa]                    #提取其余两类特征
labels = labels[~setosa]                        #提取其余两类标签
virginica = (labels == 'virginica')             #读取两类中为virginica的标签

testing = np.tile([True, False], 50)            #创建布尔矩阵，其中为【true，false，true，false...】一共50对
training = ~testing                             #上面取反，这样就是间隔的将一个做训练数据，一个测试

model = learn_model(features[training], virginica[training])  #使用写好的模型训练函数
train_error = accuracy(features[training], virginica[training], model)  #模型测试训练集的正确率
test_error = accuracy(features[testing], virginica[testing], model)

print('''\
Training error was {0:.1%}.
Testing error was {1:.1%} (N = {2}).
'''.format(train_error, test_error, testing.sum()))

上面的threshold模块内容为下面部分。该算法是只计算某一维上特征能否分类，只考虑单独的一维特征：

import numpy as np
def learn_model(features, labels):
    best_acc = -1.0
    for fi in range(features.shape[1]):  #特征矩阵的第二维，即每个样本特征维度
        thresh = features[:,fi].copy()   #读取某一维所有特征
        thresh.sort()                    #排序这维特征
        for t in thresh:                 #t为这一维特征的阈值
            pred = (features[:,fi] > t)  #生成大于阈值的bool索引
            acc = (pred == labels).mean()#读取大于阈值的作为正确类，生成bool索引，并计算准确度，这里就是为了找到最优分类线
            if acc > best_acc:           #如果当前准确度大于记录的，就更新
                best_acc = acc
                best_fi = fi
                best_t = t
    return best_t, best_fi              #返回能够最好分类的那一维及其阈值

def apply_model(features, model):
    t, fi = model
    return features[:,fi] > t           #将大于阈值的返回

def accuracy(features, labels, model):
    preds = apply_model(features, model)
    return np.mean(preds == labels)

最开始的的准确度为94%，相比来说有些过度优化了，因为这里是使用相同的数据来训练和预测，在分了训练集和测试集之后结果为96%和90%（训练50个，测试50个）。然而，这并没有充分利用到数据集内的信息。所以又产生了新的训练方法交叉验证。而且在生成几折的时候，记得要注意每一折的平衡，因为如果测试集中都是来自同一类，那结果就没有代表性了。交叉验证的结果会低于训练集的结果，不过这却是更有说服力的结果。

from load import load_dataset
import numpy as np
from threshold import learn_model, apply_model, accuracy

features,labels = load_dataset('seeds')    #什么数据集不重要，这里是为了说明原理
labels = labels == 'Canadian'

error = 0.0
for fold in range(10):   #10折交叉验证
    training = np.ones(len(features), bool)   #建立标签对应的布尔矩阵
    training[fold::10] = 0      #因为数据是顺序排序，所以通过跳10来提取不同的数据
    testing = ~training         #将训练集取反作为测试集
    model = learn_model(features[training], labels[training])   #模型训练
    test_error = accuracy(features[testing], labels[testing], model)   #模型测试
    error += test_error          #累积误差

error /= 10.0                    #计算10折交叉验证上平均误差，作为该模型的误差

print('Ten fold cross-validated error was {0:.1%}.'.format(error))

然后本书接着以更为复杂的例子种子，其中涉及到多个特征，而且第三个特征是依据前面的两个特征计算出来的。作者也认同一个观点，一个简单的算法在一个很好的特征上的分类结果要好于一个复杂的算法在不好的特征上的应用，而且好的特征应该是基于某些容易区分的变量上可变的，而基于那些比如噪音变量来说是不变的，这才是好特征。所以在你收集数据之前，应该要有一定的判断，什么样的特征才应该被收集。然后将所有的特征给机器来计算和得到最好的分类器。通常来说，有个问题就叫做特征选择，虽然有很多方法被提出就是为了解决这个问题的，不过在实际操作中，还是简单的想法最有效，因为在实际工程上，还有个实时性需要考虑，所以很多时候是要在准确度和实时性上做权衡的。

接下来就是最近邻分类了，

from load import load_dataset
import numpy as np
from knn import learn_model, apply_model, accuracy

features,labels = load_dataset('seeds')   #装载数据集的特征和标签

def cross_validate(features, labels):      #这里使用的是10折交叉验证
    error = 0.0
    for fold in range(10):
        training = np.ones(len(features), bool)
        training[fold::10] = 0
        testing = ~training
        model = learn_model(1, features[training], labels[training])  #训练集进行模型训练
        test_error = accuracy(features[testing], labels[testing], model)  #测试集测试结果
        error += test_error                                               #累加误差

    return error/ 10.0

error = cross_validate(features, labels)    #调用上面的交叉验证函数
print('Ten fold cross-validated error was {0:.1%}.'.format(error))
#下面是对特征的每一维度进行归一化，也就是减均值，除标准方差
features -= features.mean(0)
features /= features.std(0)
error = cross_validate(features, labels)    #调用上面的交叉验证函数
print('Ten fold cross-validated error after z-scoring was {0:.1%}.'.format(error))

其中为什么要做第二步的每一维归一化，是因为当某一维度很多大，其他维度很小的时候，所谓的最近邻，或者说很多其他分类都是这个维度上的特征在起作用，所以需要将每一维的特征都进行归一化，从而使得每一维度的特征都有着相同的贡献能力。
上面的knn模块为：

import numpy as np
def learn_model(k, features, labels):       #Knn，在本文中实例化的时候k =1，见上面的代码，
    return k, features.copy(),labels.copy() #该函数只是复制下训练集

def plurality(xs):                #该函数创建个字典，然后统计K个中多少个投票为同一类
    from collections import defaultdict
    counts = defaultdict(int)
    for x in xs:
        counts[x] += 1    #对同一类投票
    maxv = max(counts.values())    #找到最大可能的
    for k,v in counts.items():    #循环找到最大那个可能对应的类别标签，然后返回
        if v == maxv:
            return k

def apply_model(features, model):
    k, train_feats, labels = model   #解析训练集的K 、特征、标签
    results = []
    for f in features:               #测试集特征一个一个的测试
        label_dist = []
        for t,ell in zip(train_feats, labels):    #训练集的特征和标签
            label_dist.append( (np.linalg.norm(f-t), ell) )  #用到了numpy的范数函数norm（），默认为2范数：平方开根号
        label_dist.sort(key=lambda d_ell: d_ell[0])     #将每个训练样本与当前测试样本计算然后排序
        label_dist = label_dist[:k]                      #取前k个完美匹配的
        results.append(plurality([ell for _,ell in label_dist])) #提取前K个匹配的标签放入plurality（）函数
    return np.array(results)        #返回整个测试集的结果标签

def accuracy(features, labels, model):  #计算准确率的函数
    preds = apply_model(features, model)   
    return np.mean(preds == labels)    #返回预测结果与标签之间相等的均值，也就是正确率

上面norm（）函数官网地址：http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm

所谓的KNN就是不只是针对一个最近的样本来计算当前样本的归属，而是通过周围的K个样本一起来计算当前样本的归属，通常K可以为5，或者对于很大的数据集来说，K可以更大。

2015/07/24，第0次修改。

posted @ 2015-07-24 17:30 仙守阅读(267) 评论(0) 收藏举报

刷新页面返回顶部

仙守

最美的不是下雨天,是曾与你躲过雨的屋檐!

《机器学习系统设计》（1）

公告