CS229:Data split
Data splits, Models and Cross Validation
Bias and Variance
-
Bias: wrong thought about how to fit the data
-
Variance: changing original dataset cause a great change in fitting result
-
underfit and overfit
Regularization
- to prevent overfitting
- creating an incentive term for the algorithm to make the parameter smaller
- called regularization term
- using \(\lambda\) as the coefficient
- 0: not using any regularization
- find a proper value to perform regularization
- too big: forcing all the parameters to be too close to 0
- SVM will minimize \(w\) itself
- when data is not enough, it's easy to overfit(over 10 times)
- find the point in th middle that has the best performance on the dataset
how to choose a proper parameter/model
- divide dataset into training, developing, testing
- train all kinds of model by training set and get hypothesis
- measure the error on the developing set(not training set, or it will overfit)
- to publish an unbiased result, test the model you choose on the test set
-
ratio to divide
- fixed model: 7/3
- not fixed: 6/2/2
- when the dataset is big enough, data in developing and testing set is shrinking. (You don't need too much data to tell if one model is much better than another, unless they are quite close)
-
Above is called the simple hold-out cross validation
-
don't make any decision based on the test set, or it'll be not unbiased
When your dataset is small, apply k-hold CV(typical k = 10)
- divide your set into k parts
- training on k-1 part and test on the last
- repeat k times
- compute the average error and choose the best model
- (optional) training all the data using the chosen model
- leave one out CV
- amount less than 100
- similar to k-hold CV
- expensive time cost
feature selection
- find the most important feature: forward search
- begin with no features
- adding all features into model separately
- see which feature most improves the developing set performance
- choose this feature, and adding other features based on this one
- until adding features hurt the performance

浙公网安备 33010602011771号