风言枫语  

决策树的概念其实不难理解,下面一张图是某女生相亲时用到的决策树:

基本上可以理解为:一堆数据,附带若干属性,每一条记录最后都有一个分类(见或者不见),然后根据每种属性可以进行划分(比如年龄是>30还是<=30),这样构造出来的一棵树就是我们所谓的决策树了,决策的规则都在节点上,通俗易懂,分类效果好。

那为什么跟节点要用年龄,而不是长相?这里我们在实现决策树的时候采用的是ID3算法,在选择哪个属性作为节点的时候采用信息论原理,所谓的信息增益。信息增益指原有数据集的熵-按某个属性分类后数据集的熵。信息增益越大越好(说明按某个属性分类后比较纯),我们会选择使得信息增益最大的那个属性作为当层节点的标记,再进行递归构造决策树。

首先我们构造数据集:

def createDataSet():
	dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
	features = ['no surfacing','flippers']
	return dataSet,features


构造决策树:(采用python字典来递归构造,一些代码看看就能看懂)

def treeGrowth(dataSet,features):
	classList = [example[-1] for example in dataSet]
	if classList.count(classList[0])==len(classList):
		return classList[0]
	if len(dataSet[0])==1:# no more features
		return classify(classList)

	bestFeat = findBestSplit(dataSet)#bestFeat is the index of best feature
	bestFeatLabel = features[bestFeat]
	myTree = {bestFeatLabel:{}}
	featValues = [example[bestFeat] for example in dataSet]
	uniqueFeatValues = set(featValues)
	del (features[bestFeat])
	for values in uniqueFeatValues:
		subDataSet = splitDataSet(dataSet,bestFeat,values)
		myTree[bestFeatLabel][values] = treeGrowth(subDataSet,features)
	return myTree

 

当没有多余的feature,但是剩下的样本不完全是一样的类别是,采用出现次数多的那个类别:

def classify(classList):
	'''
	find the most in the set
	'''
	classCount = {}
	for vote in classList:
		if vote not in classCount.keys():
			classCount[vote] = 0
		classCount[vote] += 1
	sortedClassCount = sorted(classCount.iteritems(),key = operator.itemgetter(1),reverse = True)
	return sortedClassCount[0][0]


 

寻找用于分裂的最佳属性:(遍历所有属性,算信息增益)

def findBestSplit(dataset):
	numFeatures = len(dataset[0])-1
	baseEntropy = calcShannonEnt(dataset)
	bestInfoGain = 0.0
	bestFeat = -1
	for i in range(numFeatures):
		featValues = [example[i] for example in dataset]
		uniqueFeatValues = set(featValues)
		newEntropy = 0.0
		for val in uniqueFeatValues:
			subDataSet = splitDataSet(dataset,i,val)
			prob = len(subDataSet)/float(len(dataset))
			newEntropy += prob*calcShannonEnt(subDataSet)
		if(baseEntropy - newEntropy)>bestInfoGain:
			bestInfoGain = baseEntropy - newEntropy
			bestFeat = i
	return bestFeat

 

选择完分裂属性后,就行数据集的分裂:

def splitDataSet(dataset,feat,values):
	retDataSet = []
	for featVec in dataset:
		if featVec[feat] == values:
			reducedFeatVec = featVec[:feat]
			reducedFeatVec.extend(featVec[feat+1:])
			retDataSet.append(reducedFeatVec)
	return retDataSet


计算数据集的熵:

def calcShannonEnt(dataset):
	numEntries = len(dataset)
	labelCounts = {}
	for featVec in dataset:
		currentLabel = featVec[-1]
		if currentLabel not in labelCounts.keys():
			labelCounts[currentLabel] = 0
		labelCounts[currentLabel] += 1
	shannonEnt = 0.0

	for key in labelCounts:
		prob = float(labelCounts[key])/numEntries
		if prob != 0:
			shannonEnt -= prob*log(prob,2)
	return shannonEnt


下面根据上面构造的决策树进行数据的分类:

def predict(tree,newObject):
	while isinstance(tree,dict):
		key = tree.keys()[0]
		tree = tree[key][newObject[key]]
	return tree

if __name__ == '__main__':
	dataset,features = createDataSet()
	tree = treeGrowth(dataset,features)
	print tree
	print predict(tree,{'no surfacing':1,'flippers':1})
	print predict(tree,{'no surfacing':1,'flippers':0})
	print predict(tree,{'no surfacing':0,'flippers':1})
	print predict(tree,{'no surfacing':0,'flippers':0})


结果如下:

决策树是这样的:

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

四个预测:

yes
no
no
no

和给定的数据集分类一样(预测的数据是从给定数据集里面抽取的,当然一般数据多的话,会拿一部分做训练数据,剩余的做测试数据)

 

归纳一下ID3的优缺点:

优点:实现比较简单,产生的规则如果用图表示出来的话,清晰易懂,分类效果好

缺点:只能处理属性值离散的情况(连续的用C4.5),在选择最佳分离属性的时候容易选择那些属性值多的一些属性。



 

 

posted on 2013-09-27 20:00  风言枫语  阅读(694)  评论(0)    收藏  举报