ML_inAction—Chapetr8--(1) 线性回归

一、数据导入与标准LR函数

先看一下涉及到的numpy中矩阵的几种操作：

numpy中的数组也可以进行矩阵的操作，但和matrix模块中的操作有所不同：

1. mat()函数与array()函数生成矩阵所需的数据格式有区别

(1) mat()函数中数据可以为字符串以分号(；)分割，或者为列表形式以逗号（，）分割。

和

(2) 而array()函数生成矩阵时数据只能为列表形式。

2. mat()函数与array()函数生成的矩阵计算方式不同

(1) mat()函数中矩阵的乘积可以使用（星号） * 或 .dot()函数，其结果相同。而矩阵对应位置元素相乘需调用numpy.multiply()函数。

(2) array()函数中矩阵的乘积只能使用 .dot()函数。而星号乘（*）则表示矩阵对应位置元素相乘，与numpy.multiply()函数结果相同。

如生成以下矩阵：

a = numpy.mat([[1, 3], [5, 7]])

b = numpy.mat([[2, 4], [6, 8]])

c = numpy.array([[1, 3], [5, 7]])

d = numpy.array([[2, 4], [6, 8]])

则 a * b = a.dot(b) = c.dot(d) ，其表示矩阵相乘。（c.dot(d)也可以写成numpy.dot(c,d)）

而 numpy.multiply(a, b) = c * d = numpy.multiply(c, d) ，其表示矩阵对应位置元素相乘。
---------------------
原文：https://blog.csdn.net/Build_Tiger/article/details/79848808

3.矩阵求逆与转置、行列式

矩阵求逆：mat.I

数组求逆：

import numpy as np
from numpy.linalg import inv
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 19]])
b = inv(a)

矩阵转置：mat.T

矩阵（数组）行列式：numpy.linalg.det(mat)

2.代码实现

#程序清单8.1 数据导入函数与标准回归函数
from numpy import *

def loadDataSet(filename):
    f = open(filename)
    numFeature = len(f.readline().split('\t')) - 1
    matX = []
    matY = []
    for line in f.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeature):
            lineArr.append(float(curLine[i]))
        matX.append(lineArr)
        matY.append(float(curLine[-1]))
    f.close()
    return matX,matY

def standRegres(xArr,yArr):
    Xmat = mat(xArr)
    Ymat = mat(yArr).T
    if linalg.det(Xmat.T*Xmat) == 0:
        print('xTx is singular,can not be inversed!')
        return
    best_w = (Xmat.T*Xmat).I*Xmat.T*Ymat
    return best_w

#test 8.1 数据导入函数与标准回归函数
import regression
from numpy import *

xArr,yArr = regression.loadDataSet('ex0.txt')
print(xArr[-1])

ws = regression.standRegres(xArr,yArr)
print(ws)

###
不知道为啥，第一行数据不见了
[1.0, 0.116163]
[[3.00681047]
 [1.69667188]]

画图：

flatten()函数见：https://blog.csdn.net/taotiezhengfeng/article/details/72303127

即该函数返回一个折叠成一维的数组：

数组：arr.flatten() 返回一个1xn数组，n为数组中元素个数

矩阵：mat.flatten() 返回一个1xn矩阵，n为矩阵中元素个数。

　　 mat.flatten().A 返回一个1xn数组（二维数组）array([[ 1, 2, 3, 5, 6, 7, 8, 9, 10]])为1行9列

　　　mat.flatten().A[0] 表示取数组的第一行！返回一个n个数组成的数组（一维数组）array([ 1, 2, 3, 5, 6, 7, 8, 9, 10])没有行列的概念，就是一个数字排成的列表 shape(9,)

三维数组：

　　a=np.array([[[1],[2]] ,[[3],[4]] ,[[5],[6]]])

　　 print(a.shape)

>>(3,2,1) 表示3个2x1数组

　　　　数组的操作详见：https://www.cnblogs.com/Lee-yl/p/8625402.html

　　　　https://www.cnblogs.com/xzcfightingup/p/7598293.html

　　　　https://www.jb51.net/article/130651.htm

　　 ax.scatter(x,y,...) 其中参数x、y为相同长度的数组序列，形如shape(n,)，即一维数组

　　 mat.sort(0)，表示按列排序，每一列独立排，整个矩阵或数组可能被打乱

　　 mat.sort(1)，表示按行排序，每一行独立排

　　关于plot()函数，可见https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html?highlight=plot#matplotlib.pyplot.plot

>>> plot(x, y)        # plot x and y using default line style and color
>>> plot(x, y, 'bo') # plot x and y using blue circle markers
>>> plot(y)           # plot y using x as index array 0..N-1
>>> plot(y, 'r+')     # ditto, but with red plusses

#test 8.1 数据导入函数与标准回归函数
import regression
from numpy import *

xArr,yArr = regression.loadDataSet('ex0.txt')
print(xArr[-1])

ws = regression.standRegres(xArr,yArr)
print(ws)

xMat = mat(xArr)
yMat = mat(yArr)
yHat = xMat*ws

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(xMat[:,1].flatten().A[0],yMat[0,:].flatten().A[0])

xCopy = xMat
xCopy.sort(0)
yHat = xCopy*ws

ax.plot(xCopy[:,1],yHat)

plt.show()

二、局部加权线性回归LWLR

　　python之numpy库(ones,zeros,eyes函数)：https://blog.csdn.net/XWQsharp/article/details/79964175

#程序清单8.2 局部加权线性回归函数LWLR
#k为高斯核函数宽度
def lwlr(testPoint,xArr,yArr,k):
    Xmat = mat(xArr)
    Ymat = mat(yArr).T
    m = shape(Xmat)[0]
    weight = eye(m)    #对角阵，对称，左乘矩阵会对矩阵各行进行缩放
    for i in range(m):    #取所有样本集，与kNN略不同
        diff_x = Xmat[i,:] - testPoint
        weight[i,i] = exp((diff_x*diff_x.T)/(-2*k**2))  #书中高斯核函数指数部分分子少了平方
    xTx = Xmat.T*weight*Xmat
    if linalg.det(xTx) == 0.0:
        print('xTx is singular,can not be inversed!')
        return
    best_w = xTx.I*Xmat.T*weight*Ymat
    return testPoint*best_w

def lwlrTest(testArr,xArr,yArr,k):
    m = shape(testArr)[0]
    yHat = zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

#test 8.2 数据导入函数与标准回归函数
import regression
from numpy import *
xArr,yArr = regression.loadDataSet('ex0.txt')
print(yArr[0],'\n')

k = 1
yHat = regression.lwlrTest(xArr,xArr,yArr,k)  #!所有点都做一次lwlr

#排序
xMat = mat(xArr)
strInd = xMat[:,1].argsort(0)    #按列排序，第一列都是1,返回索引
xSort = xMat[strInd][:,0,:]
#绘图
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(xSort[:,1],yHat[strInd])
ax.scatter(xMat[:,1].flatten().A[0],mat(yArr).T.flatten().A[0],s=2,c='red')
plt.show()

　　k=1时，偏差大，欠拟合。

　　k=0.01时，还行。

　　k=0.03时，过拟合。

3.小结

　　1. 标准线性回归比较简单，由均方误差引到矩阵求导得到正规方程。也可以从线代的角度，更加直观的理解。

　　2. 由皮尔逊相关系数可以评价一下预测值和真实值的匹配程度，即相关性。

　　3. 线性回归容易出现”欠拟合“，通过局部加权线性回归更加关注局部的特征，得到的曲线更加光滑一些，但预测每个输入时都要遍历训练集，计算量大。在k比较小时，很多样本点其实可以忽略掉，从而减小计算量。

　　4. 局部加权线性回归的思想类似kNN法，将输入点附近的点用来进行线性回归。同样可以从线代中方程组角度理解，权重矩阵作用于投影和原向量的误差上，忽略离原向量较远的点的误差（向量距离体现在核函数中）

　　5. 在k过小时，出现过拟合。

posted @ 2018-11-11 00:21 bo0814 阅读(70) 评论(0) 收藏举报

刷新页面返回顶部

ML_inAction—Chapetr8--(1) 线性回归

公告