CS229: decision tree

Decision tree

To deal with nonlinear classification

Greedy Top-down Recursive Partitioning

  • ask questions to divide the entire space into independent regions

  • sort function

    • Region R_{P}

    • looking for a split S_{P}(j,t), while j is the feature number and t acts as the threshold value

    • divide R_{p} into R_{1} and R_{2}

  • how to choose splits

    • define L(R): loss on R

    • define \hat{P_{c}} to be the proportion of examples in R that are of class c(ratio of being right)

    • L_{misclass} = 1 - \max_{c} \hat{P_{c}}: loss caused by misclassification, (that's to say, except for the major class)

    • \max_{j,t} L(R)-(L(R_{1})-L(R_{2}))

  • Misclassification loss has issues

    • can't reflect split that distinguish some majority but not affect misclassification ones.

    • define cross_entropy loss L_{cross} = -\Sigma_{c} \hat{p_{c}}log_{c}\hat{p_{c}} (Information Theory)

      • we can also use gini loss

  • geometric theory

    • \hat{P_{c}} versus L(R)

      • P_{parent} = \frac{P_{children1}+P_{children2}}{2}(suppose an even split, it doesn't matter)

      • L(R)_{parent} \geq \frac{L(R)_{children1}+L(R)_{children2}}{2}(decline in loss)

    • for the classification loss, we don't see any decline

regression tree

  • instead of predicting the majority class, we should predict the mean.

  • Prediction

    • region: R_{m}

    • \hat{y_{m}}=\sum _{i\in R_{m}}\dfrac{y_{i}}{|R_{m}|}

    • L_{square} = \dfrac{\sum _{i\in R_{m}}\left( y_{i}-y_{m}\right) ^{2}}{|R_{m}|}

Categorical variable

  • remember the possible splits will grow at a exponential speed

  • high variance model

    • regularization

regularization

  1. min leaf size

  2. max tree depth

  3. max node number

  4. min decrease in loss

  5. Pruning(misclassification with val set)

Runtime

  • define

    • n examples

    • f features

    • d depth

  • test time: O(d), where d \leq log_{2}n

  • Train time: O(d) search, O(f) features each node

    • total cost O(nfd)

    • data matrix: n*f

Downside

  • No additive structure: when features are interacting additively with one another.

Recap

  • advantage

    • easy to explain

    • interpretable

    • categorical var

    • fast

  • dis

    • high variance

    • bad at additive

    • low predictive accurancy

Ensembling

  • X_{i} are random variables, independent and identically distributed

    • (often not independent)

  • Var(x) = \sigma^{2}

  • Var(\overline{x}) = Var(\Sigma_{i}x_{i}/n) = \sigma^{2}/n

  • if not independent:

    • correlated by \rho

    • Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}

ways to ensemble

  1. different algorithms

  2. different training sets(not very useful)

  3. bagging(random forest)

  4. boosting(Ada boost, xg boost)

Bagging: Bootstrap Aggregation

  • true population: p

  • train set: s from p

  • Assume s = p

  • Bootstrap samples Z from S, and repeat(Z two thirds of S?)

Bootstrap samples Z_{1},\cdots,Z_{m}, train model G_{m} on Z_{m}

G = \dfrac{\Sigma^{m}_{i=1}G_{i}}{m}

  • why is it work: bias variance analysis

    • Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}

    • boot strapping is driving down \rho(how ?)

    • increasing m, less variance and no over fitting

    • slightly increase the bias because of random subsampling

DTs and Bagging

  • DT: high variance, low bias

  • fit for bagging

random forest

  • at each point, only consider a fraction of your features

    • decrease \rho

    • decrease correlated model

boosting

  • decrease bias

  • more additive

  • weighing more on mistakes last time

  • determine the weight of each classifier by how many mistakes it has made

    • adaboost: weight proportional log(\dfrac{1-err_{m}}{err_{m}})

    • G = \dfrac{\Sigma^{m}_{i=1}\alpha_{i}G_{i}}{m}

    •  

posted @ 2022-03-14 12:37  Phile-matology  阅读(45)  评论(0)    收藏  举报