3种决策树算法

这是一个经典的西瓜数据集,让我们来求它的信息增益(ID3决策树算法)、决策树、信息增益率(C4.5决策树算法)、基尼指数

image-20251116220930392


信息增益(ID3算法)

记住两个量"信息熵"和"信息增益"

[信息熵]表示信息的混乱程度,熵越大表明数据越混乱。我们分类的目的是为了使同一类别的数据尽可能纯净,因此追求尽量小的信息熵。

[信息增益]{.underline}表示分类前后信息熵的差值。分类前信息熵(entropy)是定值,分类后信息熵越小,信息增益越大,这也是我们所需要的。

entropy(D)表示未分类时数据D的信息熵

$$
entropy(D) = - \sum_{i = 1}^{k}p(c_{i})\log_{2}p(c_{i})
$$

其中,Ci表示样本的分类变量取i的概率

回归到这个西瓜数据集,我们首先计算一下未进行分类前的信息熵entropy(D)

未分类时,我们怎么算样本的信息熵呢?

$$
entropy(D) = - \sum_{i = 1}^{k}p(c_{i})\log_{2}p(c_{i})
$$
根据公式,我们要算样本所占的比例,我们还没对数据集进行分类时,首先对结果进行分类,发现有好瓜和坏瓜2种,分别占比好瓜8/17和坏瓜9/17

$$
entropy(D) = - \frac{8}{17}\log_{2}\frac{8}{17} - \frac{9}{17}\log_{2}\frac{9}{17}
$$
至于计算这个式子的值,分享两种方法:

①Excel表格

众所周知,Excel是一个强大的数据处理工具,我们可以利用其中公式来进行计算

令一列值为X,一列为函数值结果Y

$$
Y = IF(OR(A2 = 0,A2 = 1),0, - A2 \ast LOG(A2,2) - (1 - A2) \ast LOG(1 - A2,2))
$$
image-20251116221223207

②编程实现

以python语言编程设计实现为例,其他语言相似

image-20251116221239662

编程实现略显繁琐,强推Excel表格,方便快捷。

至此,我们求得了未分类时数据D的信息熵entropy(D)=0.998

接下来求分类后的数据集的信息熵entropy(D,A)

$$
entropy(D,A) = \sum_{i = 1}^{m}\frac{|D_{i}|}{|D|} \cdot entropy(D_{i})
$$
A是代表按照属性A分类后,$\frac{|D_{i}|}{|D|}$表示变量A取Di所占的比例


上实操:

西瓜数据集的属性集为{色泽、根蒂、敲声、纹理、脐部、触感}

我们从色泽这个属性来看,它有三个特征{青绿、乌黑、浅白}

按色泽这个属性分类,信息熵为:

$$
entropy(D,色泽)= \frac{6}{17}\left( - \frac{3}{6}\log_{2}\frac{3}{6} - \frac{3}{6}\log_{2}\frac{3}{6} \right) + \frac{6}{17}\left( - \frac{4}{6}\log_{2}\frac{4}{6} - \frac{4}{6}\log_{2}\frac{4}{6} \right) \+
\frac{5}{17}\left( - \frac{1}{5}\log_{2}\frac{1}{5} - \frac{4}{5}\log_{2}\frac{4}{5} \right) = 0.889
$$

前面的这三个6/17,6/17,5/1}比例分别对应三个特征{青绿、乌黑、浅白}所占比例,括号里求的是当前特征下的信息熵也就是entropy(Di)

即这个特征下好瓜和坏瓜的比例
$$
entropy(D_{i}) = - \frac{好瓜}{i特征}\log_{2}\frac{好瓜}{i特征} - \frac{坏瓜}{i特征}\log_{2}\frac{坏瓜}{i特征}
$$

$$
entropy(青绿) = - \frac{好瓜3}{青绿6}\log_{2}\frac{好瓜3}{青绿6} - \frac{坏瓜3}{青绿6}\log_{2}\frac{坏瓜3}{青绿6}
$$


信息增益Gain
$$
Gain(D,A) = entropy(D) - entropy(D,A)
$$

"色泽"属性的信息增益为:

$$
Gain(D,色泽) = entropy(D) - entropy(D,色泽)=0.998-0.889=0.109
$$


同理,计算出其他属性的信息增益:

{根蒂}
$$
\begin{aligned}
entropy(D,根蒂)
& = \frac{8}{17}\left( - \frac{5}{8}\log_{2}\frac{5}{8} - \frac{3}{8}\log_{2}\frac{3}{8} \right) + \frac{7}{17}\left( - \frac{3}{7}\log_{2}\frac{3}{7} - \frac{4}{7}\log_{2}\frac{4}{7} \right) \
& \quad + \frac{2}{17}\left( - \frac{0}{2}\log_{2}\frac{0}{2} - \frac{2}{2}\log_{2}\frac{2}{2} \right) \
& = \frac{8}{17} \times 0.954 + \frac{7}{17} \times 0.985 + 0 = 0.855
\end{aligned}
$$

$$
\begin{aligned}
Gain(D,根蒂) = entropy(D) - entropy(D,根蒂)
= 0.998 - 0.855 = 0.143
\end{aligned}
$$

{敲声}

$$
\begin{aligned}
& entropy(D,敲声) \
& = \frac{10}{17}\left( - \frac{6}{10}\log_{2}\frac{6}{10} - \frac{4}{10}\log_{2}\frac{4}{10} \right) + \frac{5}{17}\left( - \frac{3}{5}\log_{2}\frac{3}{5} - \frac{2}{5}\log_{2}\frac{2}{5} \right) \
& \quad + \frac{2}{17}\left( - \frac{0}{2}\log_{2}\frac{0}{2} - \frac{2}{2}\log_{2}\frac{2}{2} \right) \
& = \frac{10}{17} \times 0.971 + \frac{5}{17} \times 0.971 + 0 = 0.857
\end{aligned}
$$

$$
\begin{aligned}
Gain(D,敲声) & = entropy(D) - entropy(D,敲声) = 0.998 - 0.857 = 0.141
\end{aligned}
$$

{纹理}

$$
\begin{aligned}
& entropy(D,纹理) \
& = \frac{9}{17}\left( - \frac{7}{9}\log_{2}\frac{7}{9} - \frac{2}{9}\log_{2}\frac{2}{9} \right) + \frac{4}{17}\left( - \frac{0}{4}\log_{2}\frac{0}{4} - \frac{4}{4}\log_{2}\frac{4}{4} \right) \
& \quad + \frac{4}{17}\left( - \frac{1}{4}\log_{2}\frac{1}{4} - \frac{3}{4}\log_{2}\frac{3}{4} \right) \
& = \frac{9}{17} \times 0.764 + 0 + \frac{4}{17} \times 0.811 = 0.595
\end{aligned}
$$

$$
\begin{aligned}
Gain(D,纹理) & = entropy(D) - entropy(D,纹理)= 0.998 - 0.595 = 0.403
\end{aligned}
$$

{脐部}

$$
\begin{aligned}
& entropy(D,脐部) \
& = \frac{7}{17}\left( - \frac{5}{7}\log_{2}\frac{5}{7} - \frac{2}{7}\log_{2}\frac{2}{7} \right) + \frac{6}{17}\left( - \frac{3}{6}\log_{2}\frac{3}{6} - \frac{3}{6}\log_{2}\frac{3}{6} \right) \
& \quad + \frac{4}{17}\left( - \frac{0}{4}\log_{2}\frac{0}{4} - \frac{4}{4}\log_{2}\frac{4}{4} \right) \
& = \frac{7}{17} \times 0.863 + \frac{6}{17} \times 1 + 0 = 0.708
\end{aligned}
$$

$$
\begin{aligned}
Gain(D,脐部) & = entropy(D) - entropy(D,脐部) = 0.998 - 0.708 = 0.290
\end{aligned}
$$

{触感}
$$
\begin{aligned}
& entropy(D,触感) \
& = \frac{12}{17}\left( - \frac{6}{12}\log_{2}\frac{6}{12} - \frac{6}{12}\log_{2}\frac{6}{12} \right) + \frac{5}{17}\left( - \frac{3}{5}\log_{2}\frac{3}{5} - \frac{2}{5}\log_{2}\frac{2}{5} \right) \
& \
& = \frac{12}{17} \times 1 + \frac{5}{17} \times 0.971 = 0.991
\end{aligned}
$$

$$
\begin{aligned}
Gain(D,触感) & = entropy(D) - entropy(D,触感) = 0.998 - 0.991 = 0.007
\end{aligned}
$$


综上所述:

Gain(D,色泽) =0.109

Gain(D,根蒂) =0.143

Gain(D,敲声) =0.141

Gain(D,纹理) =0.403

Gain(D,脐部) =0.290

Gain(D,触感) =0.007

{纹理}属性的信息增益最大,即确定了决策树的根结点为纹理

image-20251116224042439


纹理为清晰的子数据集D1=[1,2,3,4,5,6,8,10,151,2,3,4,5,6,8,10,15]有这9个样例

因为纹理属性已经确定了,所有还有{色泽、根蒂、敲声、脐部、触感}5个属性集

image-20251116224109536

$$
\begin{aligned} \text{entropy}(D1) = - \frac{7}{9}\log_{2}\frac{7}{9} - \frac{2}{9}\log_{2}\frac{2}{9} = 0.764 \end{aligned}
$$
{色泽}
$$
\begin{aligned} & \text{entropy}(D1,\text{色泽}) \ & = \frac{4}{9}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) + \frac{4}{9}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) \ & \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \ & = \frac{4}{9} \times 0.811 + \frac{4}{9} \times 0.811 + 0 = 0.721 \end{aligned} \
$$

$$
\begin{aligned} \text{Gain}(D1,\text{色泽}) & = \text{entropy}(D1) - \text{entropy}(D1,\text{色泽}) \ & = 0.764 - 0.721 = 0.043 \end{aligned}
$$

{根蒂}

$$
\begin{aligned} & \text{entropy}(D1,\text{根蒂}) \ & = \frac{5}{9}\left( - \frac{5}{5}\log_{2}\frac{5}{5} - \frac{1}{5}\log_{2}\frac{1}{5} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \ & \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \ & = 0 + \frac{3}{9} \times 0.918 + 0 = 0.306 \end{aligned}
$$

$$
\begin{aligned} \text{Gain}(D1,\text{根蒂}) & = \text{entropy}(D1) - \text{entropy}(D1,\text{根蒂}) \ & = 0.764 - 0.306 = 0.458 \end{aligned}
$$

{敲声}
$$
\begin{aligned} & \text{entropy}(D1,\text{敲声}) \ & = \frac{6}{9}\left( - \frac{5}{6}\log_{2}\frac{5}{6} - \frac{1}{6}\log_{2}\frac{1}{6} \right) + \frac{2}{9}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \ & \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \ & = \frac{6}{9} \times 0.65 + 0 + 0 = 0.433 \end{aligned}
$$

$$
\begin{aligned} \text{Gain}(D1,\text{敲声}) & = \text{entropy}(D1) - \text{entropy}(D1,\text{敲声}) \ & = 0.764 - 0.433 = 0.331 \end{aligned}
$$

{脐部}
$$
\begin{aligned} & \text{entropy}(D1,\text{脐部}) \ & = \frac{5}{9}\left( - \frac{5}{5}\log_{2}\frac{5}{5} - \frac{0}{5}\log_{2}\frac{0}{5} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \ & \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \ & = 0 + \frac{3}{9} \times 0.918 + 0 = 0.306 \end{aligned}
$$

$$
\begin{aligned} \text{Gain}(D1,\text{脐部}) & = \text{entropy}(D1) - \text{entropy}(D1,\text{脐部}) \ & = 0.764 - 0.306 = 0.458 \end{aligned}
$$

{触感}
$$
\begin{aligned} & \text{entropy}(D1,\text{触感}) \ & = \frac{6}{9}\left( - \frac{6}{6}\log_{2}\frac{6}{6} - \frac{0}{6}\log_{2}\frac{0}{6} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \ & \ & = 0 + \frac{3}{9} \times 0.918 = 0.306 \end{aligned}
$$

$$
\begin{aligned} \text{Gain}(D1,\text{触感}) & = \text{entropy}(D1) - \text{entropy}(D1,\text{触感}) \ & = 0.764 - 0.306 = 0.458 \end{aligned}
$$

{width="2.8583333333333334in"
height="2.1243055555555554in"}

综上所述:

$Gain(D1,色泽) =$<!-- -->0.043

$Gain(D1,根蒂) =$<!-- -->0.458

$Gain(D1,敲声) =$<!-- -->0.331

$Gain(D1,脐部) =$<!-- -->0.458

$Gain(D1,触感) =$<!-- -->0.458

其中{根蒂、脐部、触感}属性的信息增益最大,即确定了决策树的根结点{纹理}为清晰的子结点为这三个其中一个,因为三者信息增益相同,所以任选其一。

这里,我们选择{根蒂}为其子结点

同理,纹理为清晰,根蒂为稍蜷的子数据集D1_1=[6,8,15]有这3个样例

因为纹理、根蒂属性已经确定了,所有还有{色泽、敲声、脐部、触感}4个属性集

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6458333333333334in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.21319444444444444in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.25555555555555554in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.20625in"}

$$\begin{matrix}
entropy(D1_ 1) = - \frac{1}{3}\log_{2}\frac{1}{3} - \frac{2}{3}\log_{2}\frac{2}{3} = 0.918
\end{matrix}$$

{色泽}

$$\begin{matrix}
& entropy(D1_ 1,色泽) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,色泽) & = entropy(D1_ 1) - entropy(D1_ 1,色泽) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$$

{敲声}

$$\begin{matrix}
& entropy(D1_ 1,敲声) \
& = \frac{3}{3}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \
& \
& = 1 \times 0.918 = 0.918
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,敲声) & = entropy(D1_ 1) - entropy(D1_ 1,敲声) \
& = 0.918 - 0.918 = 0
\end{matrix}$$

{脐部}

$$\begin{matrix}
& entropy(D1_ 1,脐部) \
& = \frac{3}{3}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \
& \
& = 1 \times 0.918 = 0.918
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,脐部) & = entropy(D1_ 1) - entropy(D1_ 1,脐部) \
& = 0.918 - 0.918 = 0
\end{matrix}$$

{触感}

$$\begin{matrix}
& entropy(D1_ 1,触感) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

{width="2.847916666666667in"
height="2.4520833333333334in"}$\begin{matrix}
Gain(D1_ 1,触感) & = entropy(D1_ 1) - entropy(D1_ 1,触感) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$

综上所述:

$Gain(D1_ 1,色泽) =$<!-- -->0.251

$Gain(D1_ 1,敲声) =$<!-- -->0

$Gain(D1_ 1,脐部) =$<!-- -->0

$Gain(D1_ 1,触感) =$<!-- -->0.251

其中{色泽、触感}属性的信息增益最大,即确定了决策树的结点{根蒂}为稍糊的子结点为这两个其中一个,因为两者信息增益相同,所以任选其一。

这里,我们选择{色泽}为其子结点

同理,纹理为清晰,根蒂为稍蜷,色泽为乌黑的子数据集D1_1_1=[8,15]有这2个样例

因为纹理、根蒂、色泽属性已经确定了,所有还有{敲声、脐部、触感}3个属性集

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6527777777777778in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.25555555555555554in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.20625in"}

$$\begin{matrix}
entropy(D1_ 1_ 1) = - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} = 1
\end{matrix}$$

{敲声}

$$\begin{matrix}
& entropy(D1_ 1_ 1,敲声) \
& = \frac{2}{2}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) \
& \
& = 1 \times 1 = 1
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1_ 1,敲声) & = entropy(D1_ 1_ 1) - entropy(D1_ 1_ 1,敲声) \
& = 1 - 1 = 0
\end{matrix}$$

{脐部}

$$\begin{matrix}
& entropy(D1_ 1_ 1,脐部) \
& = \frac{2}{2}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) \
& \
& = 1 \times 1 = 1
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1_ 1,脐部) & = entropy(D1_ 1_ 1) - entropy(D1_ 1_ 1,脐部) \
& = 1 - 1 = 0
\end{matrix}$$

{触感}

$$\begin{matrix}
& entropy(D1_ 1_ 1,触感) \
& = \frac{1}{2}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) + \frac{1}{2}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) \
& \
& = 0 + 0 = 0
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1_ 1,触感) & = entropy(D1_ 1_ 1) - entropy(D1_ 1_ 1,触感) \
& = 1 - 0 = 1
\end{matrix}$$

综上所述:

$Gain(D1_ 1_ 1,敲声) =$<!-- -->0

$Gain(D1_ 1_ 1,脐部) =$<!-- -->0

$Gain(D1_ 1_ 1,触感) =$<!-- -->1

其中{触感}属性的信息增益最大,即确定了决策树的结点{色泽}为乌黑的子结点{触感}为其子结点

{width="2.8020833333333335in"
height="3.3152777777777778in"}至此,纹理为清晰这条分支到达叶子结点,

这条分支分类结束,决策树如图所示。

纹理为稍糊的子数据集D2=[7,9,13,14,17]有这5个样例

因为纹理属性已经确定了,所有还有{色泽、根蒂、敲声、脐部、触感}5个属性集

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6479166666666667in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2326388888888889in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2590277777777778in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.4222222222222222in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.21041666666666667in"}

$$\begin{matrix}
entropy(D2) = - \frac{1}{5}\log_{2}\frac{1}{5} - \frac{4}{5}\log_{2}\frac{4}{5} = 0.722
\end{matrix}$$

{色泽}

$$\begin{matrix}
& entropy(D2,色泽) \
& = \frac{2}{5}\left( - \frac{1}{2}\log_{2}\frac{2}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{2}{5}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \quad + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = \frac{2}{5} \times 1 + 0 + 0 = 0.4
\end{matrix}$$

$$\begin{matrix}
Gain(D2,色泽) & = entropy(D2) - entropy(D2,色泽) \
& = 0.722 - 0.4 = 0.322
\end{matrix}$$

{根蒂}

$$\begin{matrix}
& entropy(D2,根蒂) \
& = \frac{4}{5}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{4}{5} \times 0.811 + 0 = 0.649
\end{matrix}$$

$$\begin{matrix}
Gain(D2,根蒂) & = entropy(D2) - entropy(D2,根蒂) \
& = 0.722 - 0.649 = 0.073
\end{matrix}$$

{敲声}

$$\begin{matrix}
& entropy(D2,敲声) \
& = \frac{2}{5}\left( - \frac{1}{2}\log_{2}\frac{2}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{3}{5}\left( - \frac{3}{3}\log_{2}\frac{3}{3} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \
& = \frac{2}{5} \times 1 + 0 = 0.4
\end{matrix}$$

$$\begin{matrix}
Gain(D2,敲声) & = entropy(D2) - entropy(D2,敲声) \
& = 0.722 - 0.4 = 0.322
\end{matrix}$$

{脐部}

$$\begin{matrix}
& entropy(D2,脐部) \
& = \frac{3}{5}\left( - \frac{1}{3}\log_{2}\frac{1}{3} - \frac{2}{3}\log_{2}\frac{2}{3} \right) + \frac{2}{5}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \
& = \frac{3}{5} \times 0.918 + 0 = 0.551
\end{matrix}$$

$$\begin{matrix}
Gain(D2,脐部) & = entropy(D2) - entropy(D2,脐部) \
& = 0.722 - 0.551 = 0.171
\end{matrix}$$

{触感}

$$\begin{matrix}
& entropy(D2,触感) \
& = \frac{4}{5}\left( - \frac{4}{4}\log_{2}\frac{4}{4} - \frac{0}{4}\log_{2}\frac{0}{4} \right) + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = 0 + 0 = 0
\end{matrix}$$

$$\begin{matrix}
Gain(D2,触感) & = entropy(D2) - entropy(D2,触感) \
& = 0.722 - 0 = 0.722
\end{matrix}$$

综上所述:

$Gain(D2,色泽) =$<!-- -->0.322

$Gain(D2,根蒂) =$<!-- -->0.373

$Gain(D2,敲声) =$<!-- -->0.322

$Gain(D2,脐部) =$<!-- -->0.171

$Gain(D2,触感) =$<!-- -->0.722

其中{触感}属性的信息增益最大,即确定了决策树的根结点{纹理}为稍糊的子结点{触感}为其子结点

至此,所有结点都到了叶子结点,分类结束。西瓜数据集的决策图如图所示。

{width="3.6381944444444443in"
height="3.604861111111111in"}

【信息增益率】(C4.5决策树算法)

假设将"编号"也作为一个候选划分属性,它将产生1,2...,17个分支

它的信息增益为:

$$\begin{matrix}
entropy(D,编号) & = \frac{1}{17}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) \
& + ... + \frac{17}{17}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) \
& = \frac{1}{17} \times 0 + \frac{2}{17} \times 0 + ... + \frac{17}{17} \times 0 = 0
\end{matrix}$$

$$\begin{matrix}
Gain(D,编号) & = entropy(D) - entropy(D,编号) \
& = 0.998 - 0 = 0.998
\end{matrix}$$

可以看出,每个分支仅包含一个样本时,这些分支结点的纯度已达最大,用ID3决策树算法显然对可取值数目较多的属性---"多值属性"有所偏好,这使得决策树失去了泛化能力,无法对新样本进行有效预测,为了减少这种偏好带来的不利影响,C4.5决策树算法不直接使用信息增益,而是使用"信息增益率"

增益率定义为:

$$Gain_ ratio(D,a) = \frac{Gain(D,a)}{IV(a)}$$

其中

$$IV(a) = - \sum_{v = 1}{V}\frac{|D|}{|D|}\log_{2}\frac{|D^{v}|}{|D|}$$

信息增益率计算主要分两个部分,一个是我们上面已经求过的信息增益,一部分是$IV(a)$称为属性$a$的固有值,属性$a$的可能取值数目越多即V越大,$IV(a)$的值通常要大,所以选择增益率可能对可取值数目较少的属性有所偏好

注意:固有值IV(a)是基于当前节点所包含的样本子集来重新计算的,因此它会随着递归划分而改变。

[为什么增益率会偏好"取值少"的属性?]{.underline}

当一个属性的取值很少(如只有2个取值),它的$IV(a)$通常较小(因为划分比较粗略),所以即使信息增益不高,增益率也可能较高。

相反,一个取值很多的属性虽然信息增益高,但由于$IV(a)$很大,增益率反而可能被拉低。

所以,增益率倾向于选择那些取值较少但有一定信息增益的属性。

这种倾向可能导致:过早地选择了"不太重要"的属性而忽视了某些虽然取值多但真正有用的属性。

C4.5决策树算法的解决办法是:

这个策略分为两个步骤:

步骤1:筛选------只考虑信息增益高于平均值的属性

•首先计算所有候选属性的信息增益;

•计算这些增益的平均值;

•只保留那些信息增益高于平均值的属性。

✅作用:

•淘汰掉那些虽然增益率高但实际信息贡献小的属性(比如某些取值少但几乎不提供分类信息的属性);

•保证选中的属性至少有一定的信息价值。

步骤2:在筛选后的集合中选择增益率最高的

•在第一步筛选出的"高信息增益"属性中,再根据增益率排序;

•选出增益率最高的那个作为最终划分属性。

✅作用:

•利用增益率避免对多值属性的过度偏好;

•同时确保所选属性具有足够的信息增益(不是"虚高"的增益率)。

我们使用西瓜集来实测一下:

{色泽}

$$\begin{matrix}
entropy(D,色泽) & = \frac{6}{17}\left( - \frac{3}{6}\log_{2}\frac{3}{6} - \frac{3}{6}\log_{2}\frac{3}{6} \right) \
& + \frac{6}{17}\left( - \frac{4}{6}\log_{2}\frac{4}{6} - \frac{4}{6}\log_{2}\frac{4}{6} \right) + \frac{5}{17}\left( - \frac{1}{5}\log_{2}\frac{1}{5} - \frac{4}{5}\log_{2}\frac{4}{5} \right) \
& = 0.889
\end{matrix}$$

$$Gain(D,色泽) = entropy(D) - entropy(D,色泽)$$

=0.998-0.889=0.109

$$IV(色泽) = - {\frac{6}{17}\log_{2}}{\frac{6}{17} -}{\frac{6}{17}\log_{2}}{\frac{6}{17} -}{\frac{5}{17}\log_{2}}{\frac{5}{17} =}0.530 + 0.530 + 0.51 = 1.579$$

{根蒂}

$$\begin{matrix}
& entropy(D,根蒂) \
& = \frac{8}{17}\left( - \frac{5}{8}\log_{2}\frac{5}{8} - \frac{3}{8}\log_{2}\frac{3}{8} \right) + \frac{7}{17}\left( - \frac{3}{7}\log_{2}\frac{3}{7} - \frac{4}{7}\log_{2}\frac{4}{7} \right) \
& \quad + \frac{2}{17}\left( - \frac{0}{2}\log_{2}\frac{0}{2} - \frac{2}{2}\log_{2}\frac{2}{2} \right) \
& = \frac{8}{17} \times 0.954 + \frac{7}{17} \times 0.985 + 0 = 0.855
\end{matrix}$$

$$\begin{matrix}
Gain(D,根蒂) & = entropy(D) - entropy(D,根蒂) \
& = 0.998 - 0.855 = 0.143
\end{matrix}$$

$$IV(根蒂) = - {\frac{8}{17}\log_{2}}{\frac{8}{17} -}{\frac{7}{17}\log_{2}}{\frac{7}{17} -}{\frac{2}{17}\log_{2}}{\frac{2}{17} =}0.512 + 0.527 + 0.363 = 1.412$$

{敲声}

$$\begin{matrix}
& entropy(D,敲声) \
& = \frac{10}{17}\left( - \frac{6}{10}\log_{2}\frac{6}{10} - \frac{4}{10}\log_{2}\frac{4}{10} \right) + \frac{5}{17}\left( - \frac{3}{5}\log_{2}\frac{3}{5} - \frac{2}{5}\log_{2}\frac{2}{5} \right) \
& \quad + \frac{2}{17}\left( - \frac{0}{2}\log_{2}\frac{0}{2} - \frac{2}{2}\log_{2}\frac{2}{2} \right) \
& = \frac{10}{17} \times 0.971 + \frac{5}{17} \times 0.971 + 0 = 0.857
\end{matrix}$$

$$\begin{matrix}
Gain(D,敲声) & = entropy(D) - entropy(D,敲声) \
& = 0.998 - 0.857 = 0.141
\end{matrix}$$

$IV(敲声) = - {\frac{10}{17}\log_{2}}{\frac{10}{17} -}{\frac{5}{17}\log_{2}}{\frac{5}{17} -}{\frac{2}{17}\log_{2}}{\frac{2}{17} =}0.450 + 0.519 + 0.363 = 1.$<!-- -->332

{纹理}

$$\begin{matrix}
& entropy(D,纹理) \
& = \frac{9}{17}\left( - \frac{7}{9}\log_{2}\frac{7}{9} - \frac{2}{9}\log_{2}\frac{2}{9} \right) + \frac{4}{17}\left( - \frac{0}{4}\log_{2}\frac{0}{4} - \frac{4}{4}\log_{2}\frac{4}{4} \right) \
& \quad + \frac{4}{17}\left( - \frac{1}{4}\log_{2}\frac{1}{4} - \frac{3}{4}\log_{2}\frac{3}{4} \right) \
& = \frac{9}{17} \times 0.764 + 0 + \frac{4}{17} \times 0.811 = 0.595
\end{matrix}$$

$$\begin{matrix}
Gain(D,纹理) & = entropy(D) - entropy(D,纹理) \
& = 0.998 - 0.595 = 0.403
\end{matrix}$$

$$IV(纹理) = - {\frac{9}{17}\log_{2}}{\frac{9}{17} -}{\frac{4}{17}\log_{2}}{\frac{4}{17} -}{\frac{4}{17}\log_{2}}{\frac{4}{17} =}0.486 + 0.486 + 0.491 = 1.463$$

{脐部}

$$\begin{matrix}
& entropy(D,脐部) \
& = \frac{7}{17}\left( - \frac{5}{7}\log_{2}\frac{5}{7} - \frac{2}{7}\log_{2}\frac{2}{7} \right) + \frac{6}{17}\left( - \frac{3}{6}\log_{2}\frac{3}{6} - \frac{3}{6}\log_{2}\frac{3}{6} \right) \
& \quad + \frac{4}{17}\left( - \frac{0}{4}\log_{2}\frac{0}{4} - \frac{4}{4}\log_{2}\frac{4}{4} \right) \
& = \frac{7}{17} \times 0.863 + \frac{6}{17} \times 1 + 0 = 0.708
\end{matrix}$$

$$\begin{matrix}
Gain(D,脐部) & = entropy(D) - entropy(D,脐部) \
& = 0.998 - 0.708 = 0.290
\end{matrix}$$

$$IV(脐部) = - {\frac{7}{17}\log_{2}}{\frac{7}{17} -}{\frac{6}{17}\log_{2}}{\frac{6}{17} -}{\frac{4}{17}\log_{2}}{\frac{4}{17} =}0.527 + 0.530 + 0.491 = 1.548$$

{触感}

$$\begin{matrix}
& entropy(D,触感) \
& = \frac{12}{17}\left( - \frac{6}{12}\log_{2}\frac{6}{12} - \frac{6}{12}\log_{2}\frac{6}{12} \right) + \frac{5}{17}\left( - \frac{3}{5}\log_{2}\frac{3}{5} - \frac{2}{5}\log_{2}\frac{2}{5} \right) \
& \
& = \frac{12}{17} \times 1 + \frac{5}{17} \times 0.971 = 0.991
\end{matrix}$$

$$\begin{matrix}
Gain(D,触感) & = entropy(D) - entropy(D,触感) \
& = 0.998 - 0.991 = 0.007
\end{matrix}$$

$$IV(触感) = - {\frac{12}{17}\log_{2}}{\frac{12}{17} -}{\frac{5}{17}\log_{2}}{\frac{5}{17} =}0.258 + 0.51 = 0.768$$

综上所述:

$Gain(D,色泽) =$<!-- -->0.109
$IV(色泽) =$<!-- -->1.579

$Gain(D,根蒂) =$<!-- -->0.143
$IV(根蒂) =$<!-- -->1.412

$Gain(D,敲声) =$<!-- -->0.141
$IV(敲声) =$<!-- -->1.332

$Gain(D,纹理) =$<!-- -->0.403
$IV(纹理) =$<!-- -->1.463

$Gain(D,脐部) =$<!-- -->0.290
$IV(脐部) =$<!-- -->1.548

$Gain(D,触感) =$<!-- -->0.007
$IV(触感) =$<!-- -->0.768

首先筛选出信息增益高于平均值的

这六个属性的信息增益平均值为:

(0.109+0.143+0.141+0.403+0.209+0.007)/6=0.169

高于平均值的有{根蒂、敲声、纹理、脐部}4个属性

选择其中增益率高的:

$Gain_ ratio(D,根蒂) =$<!-- -->0.143/1.412=0.101

$Gain_ ratio(D,敲声) =$<!-- -->0.141/1.332=0.106

$Gain_ ratio(D,纹理) =$<!-- -->0.403/1.463=0.275

$Gain_ ratio(D,脐部) =$<!-- -->0.209/1.548=0.135

{纹理}属性的信息增益率最大,即确定了决策树的根结点为纹理

603016df8ea15d914da265ce2a634d27{width="3.4208333333333334in"
height="1.6875in"}

纹理为清晰的子数据集D1=[1,2,3,4,5,6,8,10,15]有这9个样例

因为纹理属性已经确定了,所有还有{色泽、根蒂、敲声、脐部、触感}5个属性集

注意:固有值IV(a)是基于当前节点所包含的样本子集来重新计算的,因此它会随着递归划分而改变。

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="1.9215277777777777in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2548611111111111in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2222222222222222in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.22291666666666668in"}

$$\begin{matrix}
entropy(D1) = - \frac{7}{9}\log_{2}\frac{7}{9} - \frac{2}{9}\log_{2}\frac{2}{9} = 0.764
\end{matrix}$$

{色泽}

$$\begin{matrix}
& entropy(D1,色泽) \
& = \frac{4}{9}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) + \frac{4}{9}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) \
& \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = \frac{4}{9} \times 0.811 + \frac{4}{9} \times 0.811 + 0 = 0.721
\end{matrix}$$

$$\begin{matrix}
Gain(D1,色泽) & = entropy(D1) - entropy(D1,色泽) \
& = 0.764 - 0.721 = 0.043
\end{matrix}$$

$$IV(D1_ 色泽) = - {\frac{4}{9}\log_{2}}{\frac{4}{9} -}{\frac{4}{9}\log_{2}}{\frac{4}{9} - {\frac{1}{9}\log_{2}}\frac{1}{9} =}0.520 + 0.520 + 0.352 = 1.392$$

{根蒂}

$$\begin{matrix}
& entropy(D1,根蒂) \
& = \frac{5}{9}\left( - \frac{5}{5}\log_{2}\frac{5}{5} - \frac{1}{5}\log_{2}\frac{1}{5} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \
& \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = 0 + \frac{3}{9} \times 0.918 + 0 = 0.306
\end{matrix}$$

$$\begin{matrix}
Gain(D1,根蒂) & = entropy(D1) - entropy(D1,根蒂) \
& = 0.764 - 0.306 = 0.458
\end{matrix}$$

$$IV(D1_ 根蒂) = - {\frac{5}{9}\log_{2}}{\frac{5}{9} -}{\frac{3}{9}\log_{2}}{\frac{3}{9} - {\frac{1}{9}\log_{2}}\frac{1}{9} =}0.471 + 0.528 + 0.352 = 1.351$$

{敲声}

$$\begin{matrix}
& entropy(D1,敲声) \
& = \frac{6}{9}\left( - \frac{5}{6}\log_{2}\frac{5}{6} - \frac{1}{6}\log_{2}\frac{1}{6} \right) + \frac{2}{9}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = \frac{6}{9} \times 0.65 + 0 + 0 = 0.433
\end{matrix}$$

$$\begin{matrix}
Gain(D1,敲声) & = entropy(D1) - entropy(D1,敲声) \
& = 0.764 - 0.433 = 0.331
\end{matrix}$$

$$IV(D1_ 敲声) = - {\frac{6}{9}\log_{2}}{\frac{6}{9} -}{\frac{2}{9}\log_{2}}{\frac{2}{9} - {\frac{1}{9}\log_{2}}\frac{1}{9} =}0.390 + 0.482 + 0.352 = 1.224$$

{脐部}

$$\begin{matrix}
& entropy(D1,脐部) \
& = \frac{5}{9}\left( - \frac{5}{5}\log_{2}\frac{5}{5} - \frac{0}{5}\log_{2}\frac{0}{5} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \
& \quad + \frac{1}{9}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = 0 + \frac{3}{9} \times 0.918 + 0 = 0.306
\end{matrix}$$

$$\begin{matrix}
Gain(D1,脐部) & = entropy(D1) - entropy(D1,脐部) \
& = 0.764 - 0.306 = 0.458
\end{matrix}$$

$$IV(D1_ 脐部) = - {\frac{5}{9}\log_{2}}{\frac{5}{9} -}{\frac{3}{9}\log_{2}}{\frac{3}{9} - {\frac{1}{9}\log_{2}}\frac{1}{9} =}0.471 + 0.528 + 0.352 = 1.351$$

{触感}

$$\begin{matrix}
& entropy(D1,触感) \
& = \frac{6}{9}\left( - \frac{6}{6}\log_{2}\frac{6}{6} - \frac{0}{6}\log_{2}\frac{0}{6} \right) + \frac{3}{9}\left( - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} \right) \
& \
& = 0 + \frac{3}{9} \times 0.918 = 0.306
\end{matrix}$$

$$\begin{matrix}
Gain(D1,触感) & = entropy(D1) - entropy(D1,触感) \
& = 0.764 - 0.306 = 0.458
\end{matrix}$$

$IV(D1_ 触感) = - {\frac{6}{9}\log_{2}}{\frac{6}{9} -}{\frac{3}{9}\log_{2}}{\frac{3}{9} =}0.390 + 0.528 =$<!-- -->0.918

综上所述:

$Gain(D1,色泽) =$<!-- -->0.043
$IV(色泽) =$<!-- -->1.392

$Gain(D1,根蒂) =$<!-- -->0.458
$IV(根蒂) =$<!-- -->1.351

$Gain(D1,敲声) =$<!-- -->0.331
$IV(敲声) =$<!-- -->1.224

$Gain(D1,脐部) =$<!-- -->0.458
$IV(脐部) =$<!-- -->1.351

$Gain(D1,触感) =$<!-- -->0.458
$IV(触感) =$<!-- -->0.918

这五个属性的信息增益平均值为:

(0.043+0.458+0.331+0.458+0.458)/5=0.3496

高于平均值的有{根蒂、脐部、触感}3个属性

选择其中增益率高的:

$Gain_ ratio(D,根蒂) =$<!-- -->0.458/1.351=0.339

$Gain_ ratio(D,脐部) =$<!-- -->0.458/1.351=0.339

$Gain_ ratio(D,触感) =$<!-- -->0.458/0.918=0.499

{触感}属性的信息增益率最大,即确定了决策树的根结点为清晰分支的结点为触感

触感为软粘的子数据集D1_1=[6,10,15]有这3个样例

因为纹理、触感属性已经确定了,所有还有{色泽、根蒂、敲声、脐部}4个属性集

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6513888888888889in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.21388888888888888in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2222222222222222in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.22291666666666668in"}

$$\begin{matrix}
entropy(D1_ 1) = - \frac{2}{3}\log_{2}\frac{2}{3} - \frac{1}{3}\log_{2}\frac{1}{3} = 0.918
\end{matrix}$$

{色泽}

$$\begin{matrix}
& entropy(D1_ 1,色泽) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,色泽) & = entropy(D1_ 1) - entropy(D1_ 1,色泽) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$$

$$IV(D1_ 1触感) = - {\frac{2}{3}\log_{2}}{\frac{2}{3} -}{\frac{1}{3}\log_{2}}{\frac{1}{3} =}0.918$$

{根蒂}

$$\begin{matrix}
& entropy(D1_ 1,根蒂) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,根蒂) & = entropy(D1_ 1) - entropy(D1_ 1,根蒂) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$$

$$IV(D1_ 1根蒂) = - {\frac{2}{3}\log_{2}}{\frac{2}{3} -}{\frac{1}{3}\log_{2}}{\frac{1}{3} =}0.918$$

{敲声}

$$\begin{matrix}
& entropy(D1_ 1,敲声) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,敲声) & = entropy(D1_ 1) - entropy(D1_ 1,敲声) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$$

$$IV(D1_ 1敲声) = - {\frac{2}{3}\log_{2}}{\frac{2}{3} -}{\frac{1}{3}\log_{2}}{\frac{1}{3} =}0.918$$

{脐部}

$$\begin{matrix}
& entropy(D1_ 1,脐部) \
& = \frac{2}{3}\left( - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{1}{3}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{2}{3} \times 1 + 0 = 0.667
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1,脐部) & = entropy(D1_ 1) - entropy(D1_ 1,脐部) \
& = 0.918 - 0.667 = 0.251
\end{matrix}$$

{width="3.0854166666666667in"
height="2.390277777777778in"}$IV(D1_ 1脐部) = - {\frac{2}{3}\log_{2}}{\frac{2}{3} -}{\frac{1}{3}\log_{2}}{\frac{1}{3} =}0.918$

综上所述:

上面四个属性信息增益相同,信息增益率也相同

随机选择一个,我选择的是{色泽}属性

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6513888888888889in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.21388888888888888in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2222222222222222in"}

同理,计算一下{根蒂、敲声、脐部}

发现{根蒂、敲声、脐部}属性信息增益与信息增益率均为:

$$\begin{matrix}
entropy(D1_ 1_ 1) = - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} = 1
\end{matrix}$$

$$\begin{matrix}
& entropy(D1_ 1_ 1,根蒂/敲声/脐部) \
& = \frac{1}{2}\left( - \frac{1}{1}\log_{2}\frac{1}{1} - \frac{0}{1}\log_{2}\frac{0}{1} \right) + \frac{1}{2}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = 0 + 0 = 0
\end{matrix}$$

$$\begin{matrix}
Gain(D1_ 1_ 1,根蒂/敲声/脐部) & = entropy(D1_ 1_ 1) - entropy(D1_ 1_ 1,根蒂/敲声/脐部) \
& = 1 - 0 = 1
\end{matrix}$$

$$IV(D1_ 1_ 1根蒂/敲声/脐部) = - \frac{1}{2}\log_{2}\frac{1}{2} - \frac{1}{2}\log_{2}\frac{1}{2} = 1$$

三者随便选其一作为子结点,所以,决策树如下图:

{width="4.295833333333333in"
height="3.8402777777777777in"}

同理,纹理为稍糊的子数据集D2=[7,9,13,14,17]有这5个样例

因为纹理属性已经确定了,所有还有{色泽、根蒂、敲声、脐部、触感}5个属性集

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.6479166666666667in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2326388888888889in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.2590277777777778in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.4222222222222222in"}

73e65252b42b2723ebe0daa1770a82da{width="5.761805555555555in"
height="0.21041666666666667in"}

$$\begin{matrix}
entropy(D2) = - \frac{1}{5}\log_{2}\frac{1}{5} - \frac{4}{5}\log_{2}\frac{4}{5} = 0.722
\end{matrix}$$

{色泽}

$$\begin{matrix}
& entropy(D2,色泽) \
& = \frac{2}{5}\left( - \frac{1}{2}\log_{2}\frac{2}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{2}{5}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \quad + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& = \frac{2}{5} \times 1 + 0 + 0 = 0.4
\end{matrix}$$

$$\begin{matrix}
Gain(D2,色泽) & = entropy(D2) - entropy(D2,色泽) \
& = 0.722 - 0.4 = 0.322
\end{matrix}$$

$IV(D2_ 色泽) = - {\frac{2}{5}\log_{2}}{\frac{2}{5} - {\frac{2}{5}\log_{2}}{\frac{2}{5} -}}{\frac{1}{5}\log_{2}}{\frac{1}{5} =}0.$<!-- -->529+0.529+0.464=1.522

{根蒂}

$$\begin{matrix}
& entropy(D2,根蒂) \
& = \frac{4}{5}\left( - \frac{3}{4}\log_{2}\frac{3}{4} - \frac{1}{4}\log_{2}\frac{1}{4} \right) + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = \frac{4}{5} \times 0.811 + 0 = 0.649
\end{matrix}$$

$$\begin{matrix}
Gain(D2,根蒂) & = entropy(D2) - entropy(D2,根蒂) \
& = 0.722 - 0.649 = 0.073
\end{matrix}$$

$IV(D2_ 根蒂) = - {\frac{4}{5}\log_{2}}{\frac{4}{5} -}{\frac{1}{5}\log_{2}}{\frac{1}{5} =}0.$<!-- -->258+0.464=0.722

{敲声}

$$\begin{matrix}
& entropy(D2,敲声) \
& = \frac{2}{5}\left( - \frac{1}{2}\log_{2}\frac{2}{2} - \frac{1}{2}\log_{2}\frac{1}{2} \right) + \frac{3}{5}\left( - \frac{3}{3}\log_{2}\frac{3}{3} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \
& = \frac{2}{5} \times 1 + 0 = 0.4
\end{matrix}$$

$$\begin{matrix}
Gain(D2,敲声) & = entropy(D2) - entropy(D2,敲声) \
& = 0.722 - 0.4 = 0.322
\end{matrix}$$

$IV(D2_ 敲声) = - {\frac{2}{5}\log_{2}}{\frac{2}{5} -}{\frac{3}{5}\log_{2}}{\frac{3}{5} =}0.$<!-- -->529+0.442=0.971

{脐部}

$$\begin{matrix}
& entropy(D2,脐部) \
& = \frac{3}{5}\left( - \frac{1}{3}\log_{2}\frac{1}{3} - \frac{2}{3}\log_{2}\frac{2}{3} \right) + \frac{2}{5}\left( - \frac{2}{2}\log_{2}\frac{2}{2} - \frac{0}{2}\log_{2}\frac{0}{2} \right) \
& \
& = \frac{3}{5} \times 0.918 + 0 = 0.551
\end{matrix}$$

$$\begin{matrix}
Gain(D2,脐部) & = entropy(D2) - entropy(D2,脐部) \
& = 0.722 - 0.551 = 0.171
\end{matrix}$$

$IV(D2_ 脐部) = - {\frac{3}{5}\log_{2}}{\frac{3}{5} -}{\frac{2}{5}\log_{2}}{\frac{2}{5} =}0.$<!-- -->529+0.442=0.971

{触感}

$$\begin{matrix}
& entropy(D2,触感) \
& = \frac{4}{5}\left( - \frac{4}{4}\log_{2}\frac{4}{4} - \frac{0}{4}\log_{2}\frac{0}{4} \right) + \frac{1}{5}\left( - \frac{0}{1}\log_{2}\frac{0}{1} - \frac{1}{1}\log_{2}\frac{1}{1} \right) \
& \
& = 0 + 0 = 0
\end{matrix}$$

$$\begin{matrix}
Gain(D2,触感) & = entropy(D2) - entropy(D2,触感) \
& = 0.722 - 0 = 0.722
\end{matrix}$$

$IV(D2_ 触感) = - {\frac{4}{5}\log_{2}}{\frac{4}{5} -}{\frac{1}{5}\log_{2}}{\frac{1}{5} =}0.$<!-- -->258+0.464=0.722

综上所述:

$Gain(D1,色泽) =$<!-- -->0.322

$Gain(D1,根蒂) =$<!-- -->0.373

$Gain(D1,敲声) =$<!-- -->0.322

$Gain(D1,脐部) =$<!-- -->0.171

$Gain(D1,触感) =$<!-- -->0.722

这五个属性的信息增益平均值为:

(0.322+0.373+0.322+0.171+0.722)/5=0.382

高于平均值的有{触感}这个属性

即确定了决策树的根结点{纹理}为稍糊的子结点{触感}为其子结点

至此,所有结点都到了叶子结点,分类结束。西瓜数据集的决策图如图所示。

【基尼指数】(CART决策树算法)

数据集D的基尼指数:

$$Gini(D) = \sum_{k = 1}^{|\mathcal{Y}|}{\sum_{k' \neq k}^{}p_{k}}p_{k'} = 1 - \sum_{k = 1}{|\mathcal{Y}|}p_{k}\quad(4.5)$$

$$其中:$$

$$\bullet D:当前数据集;$$

$$\bullet |y|:类别总数;$$

$$\bullet \ p_{k}:在数据集\ D\ 中属于第\ k\ 类样本的比例,即\ p_{k} = \frac{|D_{k}|}{|D|};$$

$$\bullet \ Gini(D):表示数据集\ D\ 的不纯度。$$

$$直观来说,Gini(D)反映了从数据集D中随机抽取两个样本,其类别标记不一致的概率。$$

$因此,Gini(D)越小,则数据集D的纯度越高。采用与式(4.2)相同的符号表示,属性a的基尼指数定义为$:

$$Gini(D,a) = \sum_{v = 1}{V}\frac{|D|}{|D|}Gini(D^{v})$$

$$于是,我们在候选属性集合A中,选择那个使得划分后基尼指数最小的属性作为最优划分属性$$

对于西瓜数据集:

{色泽}基尼指数:

$${Gini}_{浅白} = 1 - (0.2)^{2} - (0.8)^{2} = 1 - 0.04 - 0.64 = 0.32$$

$${Gini}_{青黑} = 1 - \left( \frac{4}{6} \right)^{2} - \left( \frac{2}{6} \right)^{2} = 1 - \frac{16}{36} - \frac{4}{36} = 1 - \frac{20}{36} = \frac{16}{36} \approx 0.4444$$

$${Gini}_{青绿} = 1 - (0.5)^{2} - (0.5)^{2} = 1 - 0.25 - 0.25 = 0.5$$

$$Gini(D,色泽) = \frac{|D_{1}|}{|D|}Gini(D_{1}) + \frac{|D_{2}|}{|D|}Gini(D_{2}) + \frac{|D_{3}|}{|D|}Gini(D_{2})$$

=$\frac{6}{17} \times 0.5 + \frac{6}{17} \times 0.4444 + \frac{5}{17} \times 0.32$=0.4274

{根蒂}基尼指数:

$${Gini}_{蜷缩} = 1 - \left( \frac{5}{8} \right)^{2} - \left( \frac{3}{8} \right)^{2} = 1 - \frac{25}{64} - \frac{9}{64} = \frac{30}{64} = 0.46875$$

${Gini}_{稍蜷} = 1 - \left( \frac{3}{7} \right)^{2} - \left( \frac{4}{7} \right)^{2} = 1 - \frac{9}{49} - \frac{16}{49} =$<!-- -->0.4898

$${Gini}_{硬挺} = 1 - \left( \frac{0}{2} \right)^{2} - \left( \frac{2}{2} \right)^{2} = 1 - 0 - 1 = 0$$

$$Gini(D,根蒂) = \frac{8}{17} \times 0.46875 + \frac{6}{17} \times 0.4898 + \frac{3}{17} \times 0 \approx 0.3935$$

{敲声}基尼指数:

$${Gini}_{沉闷} = 1 - \left( \frac{2}{5} \right)^{2} - \left( \frac{3}{5} \right)^{2} = 1 - \frac{4}{25} - \frac{9}{25} = \frac{14}{25} = 0.56$$

$${Gini}_{浊响} = 1 - \left( \frac{4}{10} \right)^{2} - \left( \frac{6}{10} \right)^{2} = 1 - \frac{4}{25} - \frac{9}{25} = \frac{14}{25} = 0.56$$

$${Gini}_{清脆} = 1 - \left( \frac{0}{2} \right)^{2} - \left( \frac{2}{2} \right)^{2} = 1 - 0 - 1 = 0$$

$Gini(D,敲声) = \frac{5}{17} \times 0.56 + \frac{10}{17} \times 0.56 + \frac{2}{17} \times 0 \approx 0.4$<!-- -->941

{纹理}基尼指数:

$${Gini}_{清晰} = 1 - \left( \frac{7}{9} \right)^{2} - \left( \frac{2}{9} \right)^{2} = 1 - \frac{49}{81} - \frac{4}{81} = \frac{28}{81} \approx 0.3457$$

$${Gini}_{稍糊} = 1 - \left( \frac{1}{5} \right)^{2} - \left( \frac{4}{5} \right)^{2} = 1 - \frac{1}{25} - \frac{16}{25} = \frac{8}{25} \approx 0.32$$

$${Gini}_{模糊} = 1 - \left( \frac{0}{3} \right)^{2} - \left( \frac{3}{3} \right)^{2} = 1 - 0 - 1 = 0$$

$$Gini(D,纹理) = \frac{9}{17} \times 0.3457 + \frac{5}{17} \times 0.32 + 0 \approx 0.2771$$

{脐部}基尼指数:

$${Gini}_{凹陷} = 1 - \left( \frac{5}{7} \right)^{2} - \left( \frac{2}{7} \right)^{2} = 1 - \frac{25}{49} - \frac{4}{49} = \frac{20}{49} = 0.4082$$

$${Gini}_{稍凹} = 1 - \left( \frac{3}{6} \right)^{2} - \left( \frac{3}{6} \right)^{2} = 1 - 0.25 - 0.25 = 0.5$$

$${Gini}_{平坦} = 1 - \left( \frac{0}{4} \right)^{2} - \left( \frac{4}{4} \right)^{2} = 1 - 0 - 1 = 0$$

$$Gini(D,脐部) = \frac{7}{17} \times 0.4082 + \frac{6}{17} \times 0.5 + \frac{4}{17} \times 0 \approx 0.3446$$

{触感}基尼指数:

$${Gini}_{硬滑} = 1 - \left( \frac{6}{12} \right)^{2} - \left( \frac{6}{12} \right)^{2} = 1 - 0.25 - 0.25 = 0.5$$

$${Gini}_{软粘} = 1 - \left( \frac{2}{5} \right)^{2} - \left( \frac{3}{5} \right)^{2} = = 1 - \frac{4}{25} - \frac{9}{25} = \frac{14}{25} = 0.56$$

$$Gini(D,触感) = \frac{12}{17} \times 0.5 + \frac{5}{17} \times 0.56 \approx 0.5176$$

其中{纹理}的基尼指数最小,根结点为{纹理}

603016df8ea15d914da265ce2a634d27{width="2.763888888888889in"
height="1.363888888888889in"}

其他一样像之前ID3和C4.5决策树算法一样计算,不再赘述。

posted @ 2025-11-16 22:51  落下的秋叶林  阅读(0)  评论(0)    收藏  举报