11.6

实验三：C4.5（带有预剪枝和后剪枝）算法实现与测试

一、实验目的

深入理解决策树、预剪枝和后剪枝的算法原理，能够使用 Python 语言实现带有预剪枝和后剪枝的决策树算法 C4.5 算法的训练与测试，并且使用五折交叉验证算法进行模型训练与评估。

二、实验内容

（1）从 scikit-learn 库中加载 iris 数据集，使用留出法留出 1/3 的样本作为测试集（注意同分布取样）；

（2）使用训练集训练分类带有预剪枝和后剪枝的 C4.5 算法；

（3）使用五折交叉验证对模型性能（准确度、精度、召回率和 F1 值）进行评估和选择；

（4）使用测试集，测试模型的性能，对测试结果进行分析，完成实验报告中实验三的部分。

三、算法步骤、代码、及结果

1. 算法伪代码

Algorithm C4.5(data, attributes, validation_data):

// data: 当前节点的样本数据

// attributes: 可用的特征

// validation_data: 验证数据集，用于后剪枝

// 判断当前节点中所有实例的类标签是否相同

If all instances in data have the same class:

Return the class label of the instances

// 如果没有可用的特征，返回出现频率最高的类标签

If attributes is empty:

Return the majority class label in data

// 选择最佳分割属性

best_attribute = ChooseBestAttribute(data, attributes)

tree = CreateNode(best_attribute)

// 对最佳属性的每一个可能取值进行递归

For each value v of best_attribute:

subset = Filter(data, best_attribute, v)

// 如果子集为空，则使用多数类作为子节点

If subset is empty:

child_label = MajorityClass(data)

Attach child_node to tree with label child_label

Else:

child_node = C4.5(subset, RemainingAttributes, validation_data)

Attach child_node to tree

// 进行预剪枝

If PrePruneCondition(tree, validation_data):

Return the leaf node with majority class label

// 进行后剪枝

PostPrune(tree, validation_data)

return tree

Function ChooseBestAttribute(data, attributes):

best_info_gain = 0

best_attribute = None

For each attribute in attributes:

info_gain = CalculateInformationGain(data, attribute)

If info_gain > best_info_gain:

best_info_gain = info_gain

best_attribute = attribute

return best_attribute

Function CalculateInformationGain(data, attribute):

// 计算信息增益，逻辑根据 C4.5 算法实现

// 返回信息增益值

Function PrePruneCondition(tree, validation_data):

// 判断是否需要进行预剪枝

// 逻辑根据验证数据集的分类准确度或其他条件实现

// 返回布尔值

Function PostPrune(tree, validation_data):

// 遍历树结构自底向上，对每个非叶节点进行剪枝

For each node in PostOrder(tree):

If CanPrune(node, validation_data):

Replace node with a leaf node using majority class label

Function CanPrune(node, validation_data):

// 判定剪枝条件，考虑剪枝后树的分类性能

// 返回布尔值

2. 算法主要代码

完整源代码\调用库方法（函数参数说明）

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, classification_report

# 1. 加载Iris数据集并留出法分割数据
iris = load_iris()
X = iris.data
y = iris.target

# 留出法分割数据集，1/3为测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=42, stratify=y)

print("训练集大小:", X_train.shape)
print("测试集大小:", X_test.shape)

# 2. 使用训练集训练带有预剪枝和后剪枝的决策树
# 训练带有预剪枝的决策树（max_depth 设置为 3）
clf_preprune = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_preprune.fit(X_train, y_train)

# 训练不带剪枝的决策树
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

# 3. 使用五折交叉验证评估模型性能
# 自定义评分函数
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted'),
}

# 评估预剪枝模型
scores_preprune = cross_validate(clf_preprune, X_train, y_train, cv=5, scoring=scoring)

# 评估不带剪枝的模型
scores_full = cross_validate(clf_full, X_train, y_train, cv=5, scoring=scoring)

# 输出结果
print("预剪枝模型性能：")
for key in scores_preprune.keys():
    print(f"{key}: {scores_preprune[key].mean()}")

print("\n不带剪枝模型性能：")
for key in scores_full.keys():
    print(f"{key}: {scores_full[key].mean()}")

# 4. 使用测试集测试模型性能
# 预测测试集结果
y_pred_preprune = clf_preprune.predict(X_test)
y_pred_full = clf_full.predict(X_test)

# 输出测试集结果
print("\n预剪枝模型测试集性能：")
print(classification_report(y_test, y_pred_preprune, target_names=iris.target_names, digits=4))

print("不带剪枝模型测试集性能：")
print(classification_report(y_test, y_pred_full, target_names=iris.target_names, digits=4))

调用库方法

1. load_iris

加载 Iris 数据集。

from sklearn.datasets import load_iris

参数：

return_X_y: 如果为 True，返回特征和目标。如果为 False，返回一个包含数据的对象（默认值为 False）

。

返回值：

返回一个包含特征和目标的对象，通常通过 iris.data 和 iris.target 获取。

2. train_test_split

将数据随机划分为训练集和测试集。

from sklearn.model_selection import train_test_split

参数：

test_size: 测试集占比（0-1之间的小数，或具体数目）。

random_state: 随机种子（确保划分可重现）。

stratify: 按类别比例划分（确保训练集和测试集类别分布一致）。

返回值：

返回划分后的训练数据和测试数据。

3. DecisionTreeClassifier

用法: DecisionTreeClassifier(max_depth, random_state)

参数:

max_depth: 决策树的最大深度。

random_state: 随机种子。

4. fit

用法: clf.fit(X_train, y_train)

作用: 训练模型。

5. cross_validate

用法: cross_validate(estimator, X, y, cv, scoring)

参数:

estimator: 需要评估的模型。

X: 特征数据。

y: 类别标签。

cv: 交叉验证的折数。

scoring: 评估指标（如准确率、精确率）。

make_scorer

作用: 将评分函数转换为可用于交叉验证的格式。

参数: 评分函数（如准确率）。

7. classification_report

用法: classification_report(y_true, y_pred, target_names, digits)

作用: 打印分类报告。

参数:

y_true: 真实标签。

y_pred: 预测标签。

target_names: 类别名称。

digits: 小数位数。

8.评价指标函数

准确率：

from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_pred)

精确率：

from sklearn.metrics import precision_score

precision_score(y_true, y_pred, average='macro')

召回率：

from sklearn.metrics import recall_score

recall_score(y_true, y_pred, average='macro')

F1 分数：

from sklearn.metrics import f1_score

f1_score(y_true, y_pred, average='macro')

参数：

y_true: 真实标签。

y_pred: 预测标签。

average: 计算多类别的评估方法（'macro'表示对每个类别计算得分，然后取平均）。

3. 训练结果截图（包括：准确率、精度（查准率）、召回率（查全率）、F1）

四、实验结果分析

1. 测试结果截图（包括：准确率、精度（查准率）、召回率（查全率）、F1）

对比分析

整体性能:

预剪枝模型在准确率、精确率、召回率和F1分数上均优于不带剪枝模型，尤其是在总体准确率上，预剪枝模型达到了0.9800，而不带剪枝模型为0.9400。

类别表现:

对于 Setosa 类别，两者模型都表现优秀，均为1.0000。

Versicolor 类别，预剪枝模型的召回率更高（0.9412），而不带剪枝模型的召回率较低（0.8235），这表明预剪枝模型在识别该类别时表现更好。

Virginica 类别，预剪枝模型在精确率和F1分数上也表现更优（分别为0.9444和0.9714），而不带剪枝模型的精确率为0.8500。

拟合与评分时间:

预剪枝模型的拟合和评分时间略低于不带剪枝模型，表明预剪枝在计算效率上也有一定优势。

结论

预剪枝模型在准确性和效率上均优于不带剪枝模型，尤其是在对某些类别的识别能力上表现更佳。这表明在训练决策树时，适当的剪枝策略有助于提高模型的泛化能力和性能。

五、心得体会

通过本次实验，我深入理解了决策树中预剪枝和后剪枝的概念及其在 C4.5 算法中的应用。通过实现带有预剪枝和后剪枝的决策树，我不仅提升了编程实践能力，还进一步理解了这些技术在模型性能优化中的重要作用。

在五折交叉验证的过程中，我意识到这种评估方法能够有效避免过拟合和欠拟合的问题，同时提供了一种相对准确且可靠的模型性能评估手段。通过调整模型的参数，我观察到了模型性能的变化，这加深了我对参数调优的理解。

使用测试集对模型进行测试时，我能够直观地看到模型在实际数据上的表现，并对比五折交叉验证的结果，分析模型的泛化能力。总的来说，这次实验不仅提高了我的编程能力，还加深了我对机器学习算法的理解。

posted @ 2024-12-21 16:16 The-rich 阅读(66) 评论(0) 收藏举报

刷新页面返回顶部

11.6

公告