使用多元线性回归根据多个因素预测医疗费用

主要步骤流程：

1. 导入包
2. 导入数据集
3. 数据预处理
- 3.1 检测缺失值
- 3.2 标签编码&独热编码
- 3.3 得到自变量和因变量
- 3.4 拆分训练集和测试集
4. 构建多元线性回归模型
5. 得到模型表达式
6. 预测测试集
7. 得到模型MSE
8. 画出吸烟与医疗费用的小提琴图

数据集链接：https://www.cnblogs.com/ojbtospark/p/16005626.html

1. 导入包

In [1]:

# 导入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. 导入数据集

In [2]:

# 导入数据集
data = pd.read_csv('insurance.csv')
data.head(5)

Out[2]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

3. 数据预处理

3.1 检测缺失值

In [3]:

# 检测缺失值
null_df = data.isnull().sum()
null_df

Out[3]:

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

3.2 标签编码&独热编码

In [4]:

# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True) 

3.3 得到自变量和因变量

In [5]:

# 得到自变量和因变量
y = data['charges'].values
data = data.drop(['charges'], axis = 1)
x = data.values

3.4 拆分训练集和测试集

In [6]:

# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(1070, 8)
(268, 8)
(1070,)
(268,)

4. 构建多元线性回归模型

In [7]:

# 构建多元线性回归模型
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)

Out[7]:

LinearRegression(normalize=True)

5. 得到模型表达式

In [8]:

# 得到模型表达式
print('数学表达式是：\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
    print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)

数学表达式是：
 Charges = age * 257.49 + bmi * 321.62 + children * 408.06 + sex_male * -242.15 + smoker_yes * 23786.49 + region_northwest * -396.10 + region_southeast * -1038.38 + region_southwest * -903.03 + -11297.610008539417

由上述数学表达式可见，smoker_yes变量对因变量较大

6. 预测测试集

In [9]:

# 预测测试集
y_pred = regressor.predict(x_test)

In [10]:

compare_df = pd.DataFrame(y_test, columns=['truth'])
compare_df['pred'] = y_pred
compare_df.head(10)

Out[10]:

	truth	pred
0	1646.42970	4383.680900
1	11353.22760	12885.038922
2	8798.59300	12589.216532
3	10381.47870	13286.229192
4	2103.08000	544.728328
5	38746.35510	32117.584008
6	9304.70190	12919.042372
7	11658.11505	12318.621830
8	3070.80870	3784.291456
9	19539.24300	29468.457254

7. 得到模型MSE

In [11]:

# 得到模型的MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred)
print('多元线性回归模型的MSE是：%.2f' %(mse_score))

多元线性回归模型的MSE是：35479352.81

8. 画出吸烟与医疗费用的小提琴图

In [12]:

# 画出吸烟与医疗费用的小提琴图
data['charges'] = y
sns.violinplot(x='smoker_yes', y='charges', data=data)
sns.stripplot(x='smoker_yes', y='charges', jitter=True, color='red', data=data)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f84b105108>

结论： 1)由上述小提琴图可见，不吸烟者（左图）大多数集中在中位数以下，中位数以上的点占少数；

2)吸烟者（右图）小提琴图上下比较对称分布较均匀，且最小值都达到不吸烟者医疗费用的中位数；

3)2个小提琴图对比说明吸烟者的平均医疗费用远远高于不吸烟者的平均医疗费用；

4)这证明多元线性回归模型的数学表达式比较准确，吸烟与否很大程度影响着医疗费用；

一不小心就进橘子了

橘子种植园

机器学习—回归2-2（多元线性回归）