Machine Learning Notes
Machine Learning
Classification
sensitivity = recall, 真的是阳性的里面检测出来的比例
specificity = precision, 真的是阴性里面检测出来的比例
positive predictive value, 检测出来的阳性里面真的阳性有多少
negative predictive value, 检测出来的阴性里面真的阴性有多少
如果是要筛查一种罕见病,我们关心sensitivity,我们不想放过任何一个有癌症的人
如果是要诊断是否实施手术,我们关心specificity, 我们不希望不需要做手术的人做了手术
如何测试Classifier
-
leave one out class
- small number of samples
-
repeated random subsampling
- larger set of data
- 比如80%训练,20%测试
logistic regression
- designed explicitly for probability prediction
- dependent variable can only take on a finite set of values
- usually 0 or 1
- dependent variable can only take on a finite set of values
- Finds weights for each feature
- positive implies variable positively correlated with outcome
- 比如有鳞片和是爬行动物正相关
- negative implies variable negatively correlated with outcome
- 比如腿的数量越多,越不可能是爬行动物
- absolute magnitude related to strength of the correlation
- positive implies variable positively correlated with outcome
用优化问题找出 weight 的最优解
Hands On Machine Learning with Scikit Lean Keras and Tensorflow
? Batch and Online Learning
Most important neural net architectures:
- feedforward neural nets,
- convolutional nets
- recurrent nets,
- long short-term memory (LSTM) nets,
- autoencoders
- generative adversarial networks (GANs).
Jargons
examples that system uses to learn is called training set
each training example is called a training instance (or sample)
e.g. flagging spams
- task T flag spam for new mails
- experience E is the training data
- performance measure P needs to be defined
- ratio of correctly classified emails ~ accuracy
Supervised learning
- k Nearest Neighbor
- Linear Regression
- Logistic Regression
- Support Vector Machines
- Decision Trees and Random Forests
- Neural networks
Unsupervised learning
-
Clustering
- K Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
-
Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
-
Visualization and dimensionality reduction
- PCA
- kernel PCA
- LLE
- t-SNE
-
Association rule learning
Semi-supervised learning
Reinforcement Learning
RL is a very different beast
- AlphaGo
define a policy
- rewards
- penalty
Batch and Online Learning
Batch learning
offline learning
Online learning
Can be used to trains systems on huge datasets cannot fit machine's main memory
-> out-of-core learning (usually done offline)
important parameter:
- how fast they should adapt to changing data
- -> learning rate
Instance-Based vs Model-Based Learning
summary:
- studied the data
- select a model
- trained it on the training data
- learning algorithm searched for the model parameter values that minimize a cost function
- Applied the model to make prediction on out-of-sample data
- inference
Main Challenges of Machine Learning
Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.
The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model)
Insufficient Quantity of Training data
The Unreasonable Effectiveness of Data
- Data matters more than algorithms
Nonerepresentative Training Data
- Sampling bias
Poor-Quality Data
- some outliers
- missing features
Irrelevant Features
- Feature selection
- Feature extraction
- Creating new features
Overfitting the Training Data
- Simplify the model by selection one with fewer parameters
- choose a linear model rather than a polynomial model
- reducing the numbers of attributes or by constraining the model
- Gather more training data
- reduce the noise in the training data
Underfitting the Training Data
-
Selecting a more powerful model, with more parameters
-
Feeding better features to the learning algorithm (feature engineering)
-
Reducing the constrains on the model
- reducing the constrains on the hyperparameter
Test and Validating
common to use 80% of the data for training and hold out 20% for testing
Hyperparameter Tuning and Model Selection
Holdout validation (?)
cross validation
No Free Lunch Theorem
End-to-End Machine Learning Project
Frame the Problem
- Clear the objective: How does the company expect to use and benefit form this model?
Select a Performance Measure
- RMSE
- MAE
Check the Assumptions
Create the Workspace
Download the Data
Take a quick Look at the Data Structure
housing.info
()
housing.describe()
Create a Test Set
set aside a part of the data
avoid data snooping bias
How to create -> choose 20% of the dataset randomly (less if the dataset is very large)
Stratified sampling is important.
Discover and Visualize the Data to Gain Insights
Looking for Correlations
The correlation coefficient only measures linear correlations. It may completely miss out nonlinear relationships.
The magnitude of correlation coefficient has noting to do with the slope.
Experimenting with Attribute Combinations
Prepare the Data for Machine Learning Algorithms
write function to do that, for
- reproduce the transformations easily
- build a library of transformations functions that you can reuse in future projects
- use these function in your live system
- make it possible to try various kinds of transformations
firs is to clean the data set.
Data Cleaning
here some attribute has lost some values, you can
- Get rid of the corresponding districts
- Get rid of the whole attribute
- Set the missing value to some value
Scikit-Learn Design
- consistency
- Estimators
imputer
- Transformers
imputer
fit_transform()
(maybe optimized and faster)
- Predictors
LinearRegression
score()
to measure the quality of the predictions
- Inspection
- hyperparameters are accessible via public variables
- estimator’s learned parameters are also accessible via public instance variables with an underscore suffix
imputer.statistics_
- Nonproliferation of classes
- dataset are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes.
- Composition
- easy to create a Pipeline estimator from an arbitrary sequence of transformers
- Sensible defaults
- reasonable default values for most parameters
- Estimators
Handling Text and Categorical Attributes
convert categories (strings) to numbers
本文来自博客园,作者:miyasaka,转载请注明原文链接:https://www.cnblogs.com/kion/p/16844958.html