11.4 实验一：数据准备与模型评估

一、实验目的

熟悉 Python 的基本操作，掌握对数据集的读写实现、对模型性能的评估实现的能力；

加深对训练集、测试集、N 折交叉验证、模型评估标准的理解。

二、实验内容

（1）利用 pandas 库从本地读取 iris 数据集；

（2）从 scikit-learn 库中直接加载 iris 数据集；

（3）实现五折交叉验证进行模型训练；

（4）计算并输出模型的准确度、精度、召回率和 F1 值。

三、算法步骤、代码、及结果

1. 算法伪代码

2. 算法主要代码

（1）完整源代码

【自己写的算法代码，没有调已有的该算法的库】

从本地读取 iris 数据集

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

# (1) 从本地读取 iris 数据集
iris_data = pd.read_csv('iris.csv') # 请确保该路径正确，且文件存在

# 提取特征和目标变量
X = iris_data.iloc[:, :-1].values # 特征（前四列）
y = iris_data.iloc[:, -1].values     # 目标变量（最后一列）

# 将目标变量转换为数字标签
y = pd.factorize(y)[0] # 将物种名称转换为数字标签（0, 1, 2）

# (3) 实现K折交叉验证进行模型训练
model = RandomForestClassifier(random_state=42)

# 使用 KFold 进行交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 初始化评估指标
all_y_true = []
all_y_pred = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # 训练模型
    model.fit(X_train, y_train)

    # 预测
    y_pred = model.predict(X_test)

    # 保存结果
    all_y_true.extend(y_test)
    all_y_pred.extend(y_pred)

# 计算评估指标
report = classification_report(all_y_true, all_y_pred, zero_division=0, output_dict=True)
accuracy = report['accuracy']
precision = report['weighted avg']['precision']
recall = report['weighted avg']['recall']
f1 = report['weighted avg']['f1-score']

# 输出结果
print(f'准确度: {accuracy:.4f}')
print(f'精度: {precision:.4f}')
print(f'召回率: {recall:.4f}')
print(f'F1值: {f1:.4f}')

从 scikit-learn 库中直接加载 iris 数据集

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# (1) 从 scikit-learn 库中直接加载 iris 数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

# (2) 实现五折交叉验证进行模型训练
model = RandomForestClassifier(random_state=42)

# 使用 StratifiedKFold 进行五折交叉验证
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 初始化评估指标
accuracy_list = []
precision_list = []
recall_list = []
f1_list = []

for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # 训练模型
    model.fit(X_train, y_train)

    # 预测
    y_pred = model.predict(X_test)

    # 计算评估指标
    report = classification_report(y_test, y_pred, output_dict=True)
    accuracy = report['accuracy']
    precision = report['weighted avg']['precision']
    recall = report['weighted avg']['recall']
    f1 = report['weighted avg']['f1-score']

    accuracy_list.append(accuracy)
    precision_list.append(precision)
    recall_list.append(recall)
    f1_list.append(f1)

# 计算平均值
avg_accuracy = sum(accuracy_list) / len(accuracy_list)
avg_precision = sum(precision_list) / len(precision_list)
avg_recall = sum(recall_list) / len(recall_list)
avg_f1 = sum(f1_list) / len(f1_list)

# 输出结果
print(f'准确度: {avg_accuracy:.4f}')
print(f'精度: {avg_precision:.4f}')
print(f'召回率: {avg_recall:.4f}')
print(f'F1值: {avg_f1:.4f}')

（2）调用库方法

【调用已有的该算法的库的方法，要写明调库参数都有哪些、参数是什么、代表了什么、起什么作用等等】

加载数据集

方法: load_iris()

返回值: 包含数据的字典对象，具有以下属性：

data: 特征数据（输入特征），类型为 numpy 数组。

target: 目标数据（标签），类型为 numpy 数组。

target_names: 目标类别的名称（如 ‘setosa’, ‘versicolor’, ‘virginica’）。

feature_names: 特征的名称（如 ‘sepal length (cm)’, ‘sepal width (cm)’ 等）。

2. 划分数据集

方法: train_test_split()

参数:

arrays: 输入特征和目标数据。

test_size: 测试集比例，例如 0.2（20%）。

random_state: 随机种子，确保结果可重复。

返回值: 训练集和测试集的特征及目标数据。

3. 训练模型

方法: RandomForestClassifier()

参数:

n_estimators: 森林中树的数量（如 100）。

random_state: 随机种子。

使用:

fit(X_train, y_train): 训练模型。

predict(X_test): 进行预测。

4. 评估模型

方法:

accuracy_score(y_true, y_pred): 计算准确度。

参数:

y_true: 真实标签。

y_pred: 预测标签。

precision_score(y_true, y_pred, average='weighted'): 计算精度。

参数:

y_true: 真实标签。

y_pred: 预测标签。

average: 计算方式（如 ‘micro’, ‘macro’, ‘weighted’）。weighted 会考虑每个类的样本数。

recall_score(y_true, y_pred, average='weighted'): 计算召回率。

参数:

average: 同上。

f1_score(y_true, y_pred, average='weighted'): 计算 F1 值。

参数:

average: 同上。

3. 结果截图（包括：准确率；精度、召回率、F1）

（1）准确率：***

【此处放截图】

从 scikit-learn 库中直接加载 iris 数据集

从本地读取 iris 数据集

（2）精度：***，召回率：***，F1：***

【此处放截图】

从 scikit-learn 库中直接加载 iris 数据集

从本地读取 iris 数据集

【后边实验都按照这个格式写】

四、心得体会

通过这次实验，我深入了解了如何使用 pandas 库读取本地数据文件，并掌握了从 scikit-learn 库直接加载常用数据集的方法。同时，我实践了五折交叉验证这一重要的机器学习模型评估技术，确保模型在不同子集上的稳定性和可靠性。通过计算模型的准确度、精度、召回率和 F1 值，我全面评估了模型的性能。

实验过程中，我体会到交叉验证在避免模型过拟合和欠拟合方面的重要性，以及不同评估指标在反映模型性能方面的差异。这些实践经验不仅加深了我对机器学习模型评估的理解，也为我后续的项目开发提供了宝贵的参考。

posted @ 2024-12-21 16:15 The-rich 阅读(47) 评论(0) 收藏举报

刷新页面返回顶部

11.4

实验一：数据准备与模型评估

公告