返回顶部

数字分析之一元线性回归及多元线性回归

今日内容概要

  • 数学统计分析

今日内容详细

数学统计分析

  • 高数
  • 统计学

统计学必备基础模型

判断变量之间是否有关系

  • 绘图的形式(散点图)

  • 求变量之间的相关系数(0.8,0.5,0.3,小于0.3只能说明两者之间没有线性关系)

相关系数代码求解

import pandas as pd
import numpy as np
x = [52,19,7,33,2]
y = [162,61,22,100,6]

# 均值(mean)
xmean = np.mean(x)
xmean=22.6
ymean = np.mean(y)
ymean = 70.2
# 标准差(SD)
xsd = np.std(x)
xsd = 18.183509012289132
ysd = np.std(y)
ysd = 56.29351650057047
# Z分数
zx = (x-xmean)/xsd
zx = array([ 1.61684964, -0.19798159, -0.85792022,  0.57194681, -1.13289465])
zy = (y-mean)/ysd
zy = array([ 1.63073842, -0.16342912, -0.85622649,  0.52936824, -1.14045105])
# 相关系数
r = np.sum(zx*zy)/(len(z))
r = 0.999674032661831
# numpy中的corrcoef方法直接计算
t = np.corrcoef(x,y)
t = array([[1.        , 0.99967403],
       [0.99967403, 1.        ]])
# pandas中的coor方法直接计算
data = pd.DataFrame(['x':x,'y':y])
t2 = data.corr()
t2 
运行结果:
x	y
x	1.000000	0.999674
y	0.999674	1.000000

公式推导(了解)

y = a + bx + ε
ε = (y-(a + bx))**2

代码求解

# 导入第三方模块
import statsmodels.api as sm
sm.ols(formula,data,subset=None,drop_cols=None)
formula:以字符串的形式指定线性回归模型的公式,如'y~x'就表示简单线性回归模型
data:指定建模的数据集
subset:通过bool类型的数组对象,获取data的子集用于建模
drop_cols:指定需要从data中删除的变量
    
# 导入第三方模块
import pandas as pd
import statsmodels.api as sm
income = pd.read_csv('Salary_Data.csv')
# income
# 利用收入数据集,构建回归模型
fit = sm.formula.ols('Sarlary~YearsExperience',data = income).fit()
fit.params

多元线性回归代码实现

# 导入模块
import pandas as pd
from sklearn import model_selection
# 导入数据
Profit = pd.read_excel(r'C:\Users\Administrator\Desktop\Predict to Profit.xlsx')
Profit
运行结果:
RD_Spend	Administration	Marketing_Spend	State	Profit
0	165349.20	136897.80	471784.10	New York	192261.83
1	162597.70	151377.59	443898.53	California	191792.06
2	153441.51	101145.55	407934.54	Florida	191050.39
3	144372.41	118671.85	383199.62	New York	182901.99
4	142107.34	91391.77	366168.42	Florida	166187.94
5	131876.90	99814.71	362861.36	New York	156991.12
6	134615.46	147198.87	127716.82	California	156122.51
7	130298.13	145530.06	323876.68	Florida	155752.60
8	120542.52	148718.95	311613.29	New York	152211.77
9	123334.88	108679.17	304981.62	California	149759.96
10	101913.08	110594.11	229160.95	Florida	146121.95
11	100671.96	91790.61	249744.55	California	144259.40
12	93863.75	127320.38	249839.44	Florida	141585.52
13	91992.39	135495.07	252664.93	California	134307.35
14	119943.24	156547.42	256512.92	Florida	132602.65
15	114523.61	122616.84	261776.23	New York	129917.04
16	78013.11	121597.55	264346.06	California	126992.93
17	94657.16	145077.58	282574.31	New York	125370.37
18	91749.16	114175.79	294919.57	Florida	124266.90
19	86419.70	153514.11	0.00	New York	122776.86
20	76253.86	113867.30	298664.47	California	118474.03
21	78389.47	153773.43	299737.29	New York	111313.02
22	73994.56	122782.75	303319.26	Florida	110352.25
23	67532.53	105751.03	304768.73	Florida	108733.99
24	77044.01	99281.34	140574.81	New York	108552.04
25	64664.71	139553.16	137962.62	California	107404.34
26	75328.87	144135.98	134050.07	Florida	105733.54
27	72107.60	127864.55	353183.81	New York	105008.31
28	66051.52	182645.56	118148.20	Florida	103282.38
29	65605.48	153032.06	107138.38	New York	101004.64
30	61994.48	115641.28	91131.24	Florida	99937.59
31	61136.38	152701.92	88218.23	New York	97483.56
32	63408.86	129219.61	46085.25	California	97427.84
33	55493.95	103057.49	214634.81	Florida	96778.92
34	46426.07	157693.92	210797.67	California	96712.80
35	46014.02	85047.44	205517.64	New York	96479.51
36	28663.76	127056.21	201126.82	Florida	90708.19
37	44069.95	51283.14	197029.42	California	89949.14
38	20229.59	65947.93	185265.10	New York	81229.06
39	38558.51	82982.09	174999.30	California	81005.76
40	28754.33	118546.05	172795.67	California	78239.91
41	27892.92	84710.77	164470.71	Florida	77798.83
42	23640.93	96189.63	148001.11	California	71498.49
43	15505.73	127382.30	35534.17	New York	69758.98
44	22177.74	154806.14	28334.72	California	65200.33
45	1000.23	124153.04	1903.93	New York	64926.08
46	1315.46	115816.21	297114.46	Florida	49490.75
47	0.00	135426.92	0.00	California	42559.73
48	542.05	51743.15	0.00	New York	35673.41

# 将数据集拆分为训练集和测试集
train,test = model_selection.train_test_split(Profit,test_size = 0.2,random_state=1234)
train,test
运行结果:
(     RD_Spend  Administration  Marketing_Spend       State     Profit
 36   28663.76       127056.21        201126.82     Florida   90708.19
 43   15505.73       127382.30         35534.17    New York   69758.98
 17   94657.16       145077.58        282574.31    New York  125370.37
 10  101913.08       110594.11        229160.95     Florida  146121.95
 21   78389.47       153773.43        299737.29    New York  111313.02
 20   76253.86       113867.30        298664.47  California  118474.03
 22   73994.56       122782.75        303319.26     Florida  110352.25
 1   162597.70       151377.59        443898.53  California  191792.06
 32   63408.86       129219.61         46085.25  California   97427.84
 46    1315.46       115816.21        297114.46     Florida   49490.75
 27   72107.60       127864.55        353183.81    New York  105008.31
 34   46426.07       157693.92        210797.67  California   96712.80
 25   64664.71       139553.16        137962.62  California  107404.34
 33   55493.95       103057.49        214634.81     Florida   96778.92
 0   165349.20       136897.80        471784.10    New York  192261.83
 11  100671.96        91790.61        249744.55  California  144259.40
 7   130298.13       145530.06        323876.68     Florida  155752.60
 3   144372.41       118671.85        383199.62    New York  182901.99
 37   44069.95        51283.14        197029.42  California   89949.14
 6   134615.46       147198.87        127716.82  California  156122.51
 2   153441.51       101145.55        407934.54     Florida  191050.39
 35   46014.02        85047.44        205517.64    New York   96479.51
 45    1000.23       124153.04          1903.93    New York   64926.08
 9   123334.88       108679.17        304981.62  California  149759.96
 16   78013.11       121597.55        264346.06  California  126992.93
 5   131876.90        99814.71        362861.36    New York  156991.12
 28   66051.52       182645.56        118148.20     Florida  103282.38
 40   28754.33       118546.05        172795.67  California   78239.91
 39   38558.51        82982.09        174999.30  California   81005.76
 30   61994.48       115641.28         91131.24     Florida   99937.59
 26   75328.87       144135.98        134050.07     Florida  105733.54
 41   27892.92        84710.77        164470.71     Florida   77798.83
 23   67532.53       105751.03        304768.73     Florida  108733.99
 15  114523.61       122616.84        261776.23    New York  129917.04
 24   77044.01        99281.34        140574.81    New York  108552.04
 12   93863.75       127320.38        249839.44     Florida  141585.52
 38   20229.59        65947.93        185265.10    New York   81229.06
 19   86419.70       153514.11             0.00    New York  122776.86
 47       0.00       135426.92             0.00  California   42559.73,
      RD_Spend  Administration  Marketing_Spend       State     Profit
 8   120542.52       148718.95        311613.29    New York  152211.77
 48     542.05        51743.15             0.00    New York   35673.41
 14  119943.24       156547.42        256512.92     Florida  132602.65
 42   23640.93        96189.63        148001.11  California   71498.49
 29   65605.48       153032.06        107138.38    New York  101004.64
 44   22177.74       154806.14         28334.72  California   65200.33
 4   142107.34        91391.77        366168.42     Florida  166187.94
 31   61136.38       152701.92         88218.23    New York   97483.56
 13   91992.39       135495.07        252664.93  California  134307.35
 18   91749.16       114175.79        294919.57     Florida  124266.90)

# 根据train数据集建模
model  = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+C(State)',data=train).fit()
# print('模型的偏回归系数分别为:\n',model.params)
# 删除test数据集中的Profit变量,用剩下的自变量进行预测
test_X = test.drop(labels = 'Profit',axis = 1)

pred = model.predict(exog = test_x)
print('对比预测值和实际值的差异:\n',pd.DataFrame({'Prediction':pred,'Real':test.Profit}))
运行结果:
对比预测值和实际值的差异:
        Prediction       Real
8   150621.345802  152211.77
48   55513.218079   35673.41
14  150369.022458  132602.65
42   74057.015562   71498.49
29  103413.378282  101004.64
44   67844.850378   65200.33
4   173454.059691  166187.94
31   99580.888895   97483.56
13  128147.138396  134307.35
18  130693.433835  124266.90

posted @ 2020-10-21 16:14  Satan—yuan  阅读(216)  评论(0编辑  收藏  举报