线性回归-入门案例

  • 使用公开的房价数据集进行预测,数据包含8个特征1个目标值
  • 特征最多使用2次幂

代码示例

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# 1. 读取公开数据集
data = fetch_california_housing()
print('california 房价数据简介:')
print(data.DESCR)  # 20640行,8个特征,目标值是房价
np.set_printoptions(threshold=1000)
print('california 房价特征集:')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
X = pd.DataFrame(data.data, columns=data.feature_names)  # 获取特征,封装成 DataFrame
print(X)
print('california 房价目标值:')
y = data.target  # 获取目标值,每一行特征对应的房价,单位是10w美元
print(y)

# 2. 切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. 建立多项式回归 Pipeline 包含特征标准化、特征多项式扩展、线性回归
model = Pipeline([
    ("scaler", StandardScaler()),  # 均值0,方差1
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),  # 每一个特征最多2次幂
    ("linear", LinearRegression())  # 线性回归
])

# 4. 拟合模型
model.fit(X_train, y_train)

# 5. 预测
y_pred = model.predict(X_test)

# 6. 评估
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"均方误差 MSE: {mse:.4f}")
print(f"决定系数 R²: {r2:.4f}")

# 7. 查看生成的多项式特征
poly_feature_names = model.named_steps["poly"].get_feature_names_out(X.columns)
print("多项式特征:")
print(poly_feature_names)  # 8(原特征)+8(平方)+28(交叉)=44
# 8. 查看生成的多项式参数
linear = model.named_steps['linear']
print("多项式参数:")
print(linear.coef_)  # 参数也是44个
print(linear.intercept_)

输出结果

california 房价数据简介:
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33:291-297, 1997.

california 房价特征集:
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25
...       ...       ...       ...        ...         ...       ...       ...        ...
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48    -121.09
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49    -121.21
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43    -121.22
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43    -121.32
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37    -121.24

[20640 rows x 8 columns]
california 房价目标值:
[4.526 3.585 3.521 ... 0.923 0.847 0.894]
均方误差 MSE: 0.4643
决定系数 R²: 0.6457
多项式特征:
['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Latitude' 'Longitude' 'MedInc^2' 'MedInc HouseAge' 'MedInc AveRooms'
 'MedInc AveBedrms' 'MedInc Population' 'MedInc AveOccup'
 'MedInc Latitude' 'MedInc Longitude' 'HouseAge^2' 'HouseAge AveRooms'
 'HouseAge AveBedrms' 'HouseAge Population' 'HouseAge AveOccup'
 'HouseAge Latitude' 'HouseAge Longitude' 'AveRooms^2'
 'AveRooms AveBedrms' 'AveRooms Population' 'AveRooms AveOccup'
 'AveRooms Latitude' 'AveRooms Longitude' 'AveBedrms^2'
 'AveBedrms Population' 'AveBedrms AveOccup' 'AveBedrms Latitude'
 'AveBedrms Longitude' 'Population^2' 'Population AveOccup'
 'Population Latitude' 'Population Longitude' 'AveOccup^2'
 'AveOccup Latitude' 'AveOccup Longitude' 'Latitude^2'
 'Latitude Longitude' 'Longitude^2']
多项式参数:
[ 0.93594011  0.13205802 -0.38759869  0.53020674  0.04051346 -1.78126342
 -1.27267893 -1.1676299  -0.11222558  0.03784584  0.17978116 -0.1201516
  0.11142996 -0.09883978 -0.66721635 -0.58616928  0.0332914  -0.01624672
  0.05234485  0.0360252  -0.27866746 -0.2767792  -0.25281254  0.06040245
 -0.10958604 -0.15473981  0.57792376  0.54353082  0.47907069  0.04954482
  0.24209969 -0.40169311 -0.48876332 -0.4228783   0.00195178  0.32361526
  0.03280047  0.01523969  0.00769438  0.50676749  0.36713809  0.2632096
  0.4351273   0.15301617]
1.956590491804413
posted @ 2025-09-18 15:47  java拌饭  阅读(7)  评论(0)    收藏  举报