Classification week3: decision tree 笔记

华盛顿大学 machine learnign :classification week 3 笔记

第二步:

   

  注:

    

    其中 ,mistake 的计算方法:

      给定一个节点的数据集M,对每个特征hi(x),根据特征hi(x)将节点的数据集M分类。

       统计哪个类别占多数,记为多数类。

      所有不在多数类里的数据都作为误判mistakes

     classification error = (left_mistakes + right_mistakes) / num_data_points

 

第三步:建树

  考虑到防止过拟合

  

    1. early stopping:

    停止条件: 

    

    建树过程:

def decision_tree_create(data, features, target, current_depth = 0, 
                         max_depth = 10, min_node_size=1, 
                         min_error_reduction=0.0):
    
    remaining_features = features[:] 
    target_values = data[target]
    
    # Stopping condition 1: All nodes are of the same type.
    if intermediate_node_num_mistakes(target_values) == 0:             
        return create_leaf(target_values)
    
    # Stopping condition 2: No more features to split on.
    if remaining_features == []:             
        return create_leaf(target_values)    
    
    # Early stopping condition 1: Reached max depth limit.
    if current_depth >= max_depth:
        return create_leaf(target_values)
    
    # Early stopping condition 2: Reached the minimum node size.
    if   reached_minimum_node_size(data, min_node_size):
        return create_leaf(target_values)
    
    # Find the best splitting feature and split on the best feature.
    splitting_feature = best_splitting_feature(data, features, target)
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    
    # calculate error
    error_before_split = intermediate_node_num_mistakes(target_values) / float(len(data)) 
    left_mistakes = intermediate_node_num_mistakes(left_split[target]) 
    right_mistakes = intermediate_node_num_mistakes(right_split[target])
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))
    
    # Early stopping condition 3: Minimum error reduction
    if  error_before_split - error_after_split < min_error_reduction:
        return create_leaf(target_values)
    
    remaining_features.remove(splitting_feature)
    
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction)        
    right_tree = decision_tree_create(right_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction) 
    
    return create_node(splitting_feature, left_tree, right_tree)

 

    2. pruning

      Total cost C(T) = Error(T) + λ L(T)

      

 

 

用建好的树预测数据:

  

def classify(tree, input):   
    # if the node is a leaf node.
    if tree['is_leaf']:
        return tree['prediction'] 
    else:
        # split on feature.
        split_feature_value = input[tree['splitting_feature']]
        if split_feature_value == 0:
            return classify(tree['left'], input)
        else:
            return classify(tree['right'], input)

 

   

    

 

posted @ 2018-03-16 11:06  smartweed  阅读(431)  评论(0编辑  收藏  举报