CS229: decision tree
To deal with nonlinear classification
Greedy Top-down Recursive Partitioning
-
ask questions to divide the entire space into independent regions
-
sort function
-
Region R_{P}
-
looking for a split S_{P}(j,t), while j
-
divide R_{p} into R_{1} and R_{2}
-
-
how to choose splits
-
define L(R): loss on R
-
define \hat{P_{c}} to be the proportion of examples in R that are of class c(ratio of being right)
-
L_{misclass} = 1 - \max_{c} \hat{P_{c}}: loss caused by misclassification, (that's to say, except for the major class)
-
\max_{j,t} L(R)-(L(R_{1})-L(R_{2}))
-
-
Misclassification loss has issues
-
can't reflect split that distinguish some majority but not affect misclassification ones.
-
define cross_entropy loss L_{cross} = -\Sigma_{c} \hat{p_{c}}log_{c}\hat{p_{c}} (Information Theory)
-
we can also use gini loss
-
-
-
geometric theory
-
\hat{P_{c}} versus L(R)
-
P_{parent} = \frac{P_{children1}+P_{children2}}{2}(suppose an even split, it doesn't matter)
-
L(R)_{parent} \geq \frac{L(R)_{children1}+L(R)_{children2}}{2}(decline in loss)
-
-
for the classification loss, we don't see any decline
-
regression tree
-
instead of predicting the majority class, we should predict the mean.
-
Prediction
-
region: R_{m}
-
\hat{y_{m}}=\sum _{i\in R_{m}}\dfrac{y_{i}}{|R_{m}|}
-
L_{square} = \dfrac{\sum _{i\in R_{m}}\left( y_{i}-y_{m}\right) ^{2}}{|R_{m}|}
-
Categorical variable
-
remember the possible splits will grow at a exponential speed
-
high variance model
-
regularization
-
regularization
-
min leaf size
-
max tree depth
-
max node number
-
min decrease in loss
-
Pruning(misclassification with val set)
Runtime
-
define
-
n examples
-
f features
-
d depth
-
-
test time: O(d), where d \leq log_{2}n
-
Train time: O(d) search, O(f) features each node
-
total cost O(nfd)
-
data matrix: n*f
-
Downside
-
No additive structure: when features are interacting additively with one another.
Recap
-
advantage
-
easy to explain
-
interpretable
-
categorical var
-
fast
-
-
dis
-
high variance
-
bad at additive
-
low predictive accurancy
-
Ensembling
-
X_{i} are random variables, independent and identically distributed
-
(often not independent)
-
-
Var(x) = \sigma^{2}
-
Var(\overline{x}) = Var(\Sigma_{i}x_{i}/n) = \sigma^{2}/n
-
if not independent:
-
correlated by \rho
-
Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}
-
ways to ensemble
-
different algorithms
-
different training sets(not very useful)
-
bagging(random forest)
-
boosting(Ada boost, xg boost)
Bagging: Bootstrap Aggregation
-
true population: p
-
train set: s from p
-
Assume s = p
-
Bootstrap samples Z from S, and repeat(Z two thirds of S?)
Bootstrap samples Z_{1},\cdots,Z_{m}, train model G_{m} on Z_{m}
G = \dfrac{\Sigma^{m}_{i=1}G_{i}}{m}
-
why is it work: bias variance analysis
-
Var(\overline{x}) = \rho\sigma^{2}+\dfrac{1-\rho}{n}\sigma^{2}
-
boot strapping is driving down \rho(how ?)
-
increasing m, less variance and no over fitting
-
slightly increase the bias because of random subsampling
-
DTs and Bagging
-
DT: high variance, low bias
-
fit for bagging
random forest
-
at each point, only consider a fraction of your features
-
decrease \rho
-
decrease correlated model
-
boosting
-
decrease bias
-
more additive
-
weighing more on mistakes last time
-
determine the weight of each classifier by how many mistakes it has made
-
adaboost: weight proportional log(\dfrac{1-err_{m}}{err_{m}})
-
G = \dfrac{\Sigma^{m}_{i=1}\alpha_{i}G_{i}}{m}
-
-

浙公网安备 33010602011771号