Python 线性回归

 

分析:女性身高与体重的关系

该数据集源自The World Almanac and Book of Facts(1975)
给出了年龄在30-39岁之间的15名女性的身高和体重信息

1.线性回归

# packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm

1.1数据处理

data = pd.read_csv("women.csv",index_col = 0)
X = data["height"]
X = sm.add_constant(X)
y = data["weight"]
data.describe() #数据描述性分析
 
 heightweight
count 15.000000 15.000000
mean 65.000000 136.733333
std 4.472136 15.498694
min 58.000000 115.000000
25% 61.500000 124.500000
50% 65.000000 135.000000
75% 68.500000 148.000000
max 72.000000 164.000000
 
plt.scatter(data["height"],data["weight"])
plt.show()

1.2模型拟合

model1 = sm.OLS(y,X) #最小二成模型
result = model1.fit() #训练模型
print(result.summary()) #输出训练结果
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 weight   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.990
Method:                 Least Squares   F-statistic:                     1433.
Date:                Wed, 01 Apr 2020   Prob (F-statistic):           1.09e-14
Time:                        21:40:44   Log-Likelihood:                -26.541
No. Observations:                  15   AIC:                             57.08
Df Residuals:                      13   BIC:                             58.50
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -87.5167      5.937    -14.741      0.000    -100.343     -74.691
height         3.4500      0.091     37.855      0.000       3.253       3.647
==============================================================================
Omnibus:                        2.396   Durbin-Watson:                   0.315
Prob(Omnibus):                  0.302   Jarque-Bera (JB):                1.660
Skew:                           0.789   Prob(JB):                        0.436
Kurtosis:                       2.596   Cond. No.                         982.
==============================================================================
#单独调用回归结果的参数命令:
result.params #回归系数
result.rsquared #回归拟合优度R
result.f_pvalue #F统计量p值
sm.stats.stattools.durbin_watson(result.resid) #dw统计量,检验残差自相关性
sm.stats.stattools.jarque_bera(result.resid) #jb统计量,检验残差是否服从正态分布(JB,JBp值,偏度,峰度)
(1.6595730644309838, 0.4361423787323849, 0.7893583826332282, 2.596304225738997)

1.3模型预测

y_pre = result.predict()
y_pre
array([112.58333333, 116.03333333, 119.48333333, 122.93333333,
       126.38333333, 129.83333333, 133.28333333, 136.73333333,
       140.18333333, 143.63333333, 147.08333333, 150.53333333,
       153.98333333, 157.43333333, 160.88333333])

1.4模型评价

#结果可视化
plt.rcParams['font.family']="simHei" #汉字显示
plt.plot(data["height"], data["weight"],"o")
plt.plot(data["height"], y_pre)
plt.title('女性体重与身高的线性回归分析')
Text(0.5, 1.0, '女性体重与身高的线性回归分析')
 

从上图来看,简单线性回归的效果并不好,我们采取多项式回归

2.多项式回归

2.1数据处理

data = pd.read_csv("women.csv",index_col = 0)
X = data["height"]
y = data["weight"]
X = np.column_stack((X,np.power(X,2),np.power(X,3))) #构造三阶多项式
X = sm.add_constant(X) #添加截距项
X
array([[1.00000e+00, 5.80000e+01, 3.36400e+03, 1.95112e+05],
       [1.00000e+00, 5.90000e+01, 3.48100e+03, 2.05379e+05],
       [1.00000e+00, 6.00000e+01, 3.60000e+03, 2.16000e+05],
       [1.00000e+00, 6.10000e+01, 3.72100e+03, 2.26981e+05],
       [1.00000e+00, 6.20000e+01, 3.84400e+03, 2.38328e+05],
       [1.00000e+00, 6.30000e+01, 3.96900e+03, 2.50047e+05],
       [1.00000e+00, 6.40000e+01, 4.09600e+03, 2.62144e+05],
       [1.00000e+00, 6.50000e+01, 4.22500e+03, 2.74625e+05],
       [1.00000e+00, 6.60000e+01, 4.35600e+03, 2.87496e+05],
       [1.00000e+00, 6.70000e+01, 4.48900e+03, 3.00763e+05],
       [1.00000e+00, 6.80000e+01, 4.62400e+03, 3.14432e+05],
       [1.00000e+00, 6.90000e+01, 4.76100e+03, 3.28509e+05],
       [1.00000e+00, 7.00000e+01, 4.90000e+03, 3.43000e+05],
       [1.00000e+00, 7.10000e+01, 5.04100e+03, 3.57911e+05],
       [1.00000e+00, 7.20000e+01, 5.18400e+03, 3.73248e+05]])

2.2模型拟合

model2 = sm.OLS(y,X)
result = model2.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 weight   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.679e+04
Date:                Wed, 01 Apr 2020   Prob (F-statistic):           2.07e-20
Time:                        22:09:27   Log-Likelihood:                 1.3441
No. Observations:                  15   AIC:                             5.312
Df Residuals:                      11   BIC:                             8.144
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -896.7476    294.575     -3.044      0.011   -1545.102    -248.393
x1            46.4108     13.655      3.399      0.006      16.356      76.466
x2            -0.7462      0.211     -3.544      0.005      -1.210      -0.283
x3             0.0043      0.001      3.940      0.002       0.002       0.007
==============================================================================
Omnibus:                        0.028   Durbin-Watson:                   2.388
Prob(Omnibus):                  0.986   Jarque-Bera (JB):                0.127
Skew:                           0.049   Prob(JB):                        0.939
Kurtosis:                       2.561   Cond. No.                     1.25e+09
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.25e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

2.3模型预测

y_pre = result.predict()
y_pre
array([114.63856209, 117.40676937, 120.18801264, 123.00780722,
       125.89166846, 128.86511168, 131.95365223, 135.18280543,
       138.57808662, 142.16501113, 145.9690943 , 150.01585147,
       154.33079796, 158.93944911, 163.86732026])

2.4模型评价

#结果可视化
plt.rcParams['font.family']="simHei" #汉字显示
plt.plot(data["height"], data["weight"],"o")
plt.plot(data["height"], y_pre)
plt.title('女性体重与身高的线性回归分析')
Text(0.5, 1.0, '女性体重与身高的线性回归分析')
 

 

posted @ 2020-04-01 22:25  从前有座山,山上  阅读(941)  评论(0编辑  收藏  举报