CS229: decision tree

Decision tree

To deal with nonlinear classification

Greedy Top-down Recursive Partitioning

ask questions to divide the entire space into independent regions
sort function
- Region R_{P}
- looking for a split S_{P}(j,t), while j is the feature number and t acts as the threshold value
- divide R_{p} into R_{1} and R_{2}
how to choose splits
- define L(R): loss on R
- define \hat{P_{c}} to be the proportion of examples in R that are of class c（ratio of being right)
- L_{misclass} = 1 - \max_{c} \hat{P_{c}}: loss caused by misclassification, (that's to say, except for the major class)
- \max_{j,t} L(R)-(L(R_{1})-L(R_{2}))
Misclassification loss has issues
- can't reflect split that distinguish some majority but not affect misclassification ones.
- define cross_entropy loss L_{cross} = -\Sigma_{c} \hat{p_{c}}log_{c}\hat{p_{c}} (Information Theory)
  - we can also use gini loss
geometric theory
- \hat{P_{c}} versus L(R)
  - P_{parent} = \frac{P_{children1}+P_{children2}}{2}(suppose an even split, it doesn't matter)
  - L(R)_{parent} \geq \frac{L(R)_{children1}+L(R)_{children2}}{2}(decline in loss)
- for the classification loss, we don't see any decline

regression tree

instead of predicting the majority class, we should predict the mean.
Prediction
- region: R_{m}
- \hat{y_{m}}=\sum _{i\in R_{m}}\dfrac{y_{i}}{|R_{m}|}
- L_{square} = \dfrac{\sum _{i\in R_{m}}\left( y_{i}-y_{m}\right) ^{2}}{|R_{m}|}

Categorical variable

remember the possible splits will grow at a exponential speed
high variance model
- regularization

regularization

min leaf size
max tree depth
max node number
min decrease in loss
Pruning(misclassification with val set)

Runtime

define
- n examples
- f features
- d depth
test time: O(d), where d \leq log_{2}n
Train time: O(d) search, O(f) features each node
- total cost O(nfd)
- data matrix: n*f

Downside

No additive structure: when features are interacting additively with one another.

Recap

advantage
- easy to explain
- interpretable
- categorical var
- fast
dis
- high variance
- bad at additive
- low predictive accurancy

Ensembling

X_{i} are random variables, independent and identically distributed
- (often not independent)
Var(x) = \sigma^{2}
Var(\overline{x}) = Var(\Sigma_{i}x_{i}/n) = \sigma^{2}/n
if not independent:
- correlated by \rho
- Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}

ways to ensemble

different algorithms
different training sets(not very useful)
bagging(random forest)
boosting(Ada boost, xg boost)

Bagging: Bootstrap Aggregation

true population: p
train set: s from p
Assume s = p
Bootstrap samples Z from S, and repeat(Z two thirds of S?)

Bootstrap samples Z_{1},\cdots,Z_{m}, train model G_{m} on Z_{m}

G = \dfrac{\Sigma^{m}_{i=1}G_{i}}{m}

why is it work: bias variance analysis
- Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}
- boot strapping is driving down \rho(how ?)
- increasing m, less variance and no over fitting
- slightly increase the bias because of random subsampling

DTs and Bagging

DT: high variance, low bias
fit for bagging

random forest

at each point, only consider a fraction of your features
- decrease \rho
- decrease correlated model

boosting

decrease bias
more additive
weighing more on mistakes last time
determine the weight of each classifier by how many mistakes it has made
- adaboost: weight proportional log(\dfrac{1-err_{m}}{err_{m}})
- G = \dfrac{\Sigma^{m}_{i=1}\alpha_{i}G_{i}}{m}

posted @ 2022-03-14 12:37 Phile-matology 阅读(45) 评论(0) 收藏举报

刷新页面返回顶部

Phile-matology