Machine Learning Notes

Machine Learning

Classification

\[\begin{aligned} &\text { sensitivity }=\frac{\text { true positive }}{\text { true positive }+\text { false negative }} \\ &\text { specificity }=\frac{\text { true negative }}{\text { true negative }+\text { false positive }} \\ &\text { positive predictive value }=\frac{\text { true positive }}{\text { true positive }+\text { false positive }} \\ &\text { negative predictive value }=\frac{\text { true negative }}{\text { true negative }+\text { false negative }} \end{aligned} \]

sensitivity = recall, 真的是阳性的里面检测出来的比例

specificity = precision, 真的是阴性里面检测出来的比例

positive predictive value, 检测出来的阳性里面真的阳性有多少

negative predictive value, 检测出来的阴性里面真的阴性有多少

如果是要筛查一种罕见病，我们关心sensitivity，我们不想放过任何一个有癌症的人

如果是要诊断是否实施手术，我们关心specificity, 我们不希望不需要做手术的人做了手术

如何测试Classifier

leave one out class
- small number of samples
repeated random subsampling
- larger set of data
- 比如80%训练，20%测试

logistic regression

designed explicitly for probability prediction
- dependent variable can only take on a finite set of values
  - usually 0 or 1
Finds weights for each feature
- positive implies variable positively correlated with outcome
  - 比如有鳞片和是爬行动物正相关
- negative implies variable negatively correlated with outcome
  - 比如腿的数量越多，越不可能是爬行动物
- absolute magnitude related to strength of the correlation

用优化问题找出 weight 的最优解

Hands On Machine Learning with Scikit Lean Keras and Tensorflow

? Batch and Online Learning

Most important neural net architectures:

feedforward neural nets,
convolutional nets
recurrent nets,
long short-term memory (LSTM) nets,
autoencoders
generative adversarial networks (GANs).

Jargons

examples that system uses to learn is called training set

each training example is called a training instance (or sample)

e.g. flagging spams

task T flag spam for new mails
experience E is the training data
performance measure P needs to be defined
- ratio of correctly classified emails ~ accuracy

Supervised learning

k Nearest Neighbor
Linear Regression
Logistic Regression
Support Vector Machines
Decision Trees and Random Forests
Neural networks

Unsupervised learning

Clustering
- K Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
Visualization and dimensionality reduction
- PCA
- kernel PCA
- LLE
- t-SNE
Association rule learning

Semi-supervised learning

Reinforcement Learning

RL is a very different beast

AlphaGo

define a policy

rewards
penalty

Batch and Online Learning

Batch learning

offline learning

Online learning

Can be used to trains systems on huge datasets cannot fit machine's main memory

-> out-of-core learning (usually done offline)

important parameter:

how fast they should adapt to changing data
- -> learning rate

Instance-Based vs Model-Based Learning

summary:

studied the data
select a model
trained it on the training data
- learning algorithm searched for the model parameter values that minimize a cost function
Applied the model to make prediction on out-of-sample data
- inference

Main Challenges of Machine Learning

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.

The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model)

Insufficient Quantity of Training data

The Unreasonable Effectiveness of Data

Data matters more than algorithms

Nonerepresentative Training Data

Sampling bias

Poor-Quality Data

some outliers
missing features

Irrelevant Features

Feature selection
Feature extraction
Creating new features

Overfitting the Training Data

Simplify the model by selection one with fewer parameters
- choose a linear model rather than a polynomial model
reducing the numbers of attributes or by constraining the model
Gather more training data
reduce the noise in the training data

Underfitting the Training Data

Selecting a more powerful model, with more parameters
Feeding better features to the learning algorithm (feature engineering)
Reducing the constrains on the model
- reducing the constrains on the hyperparameter

Test and Validating

common to use 80% of the data for training and hold out 20% for testing

Hyperparameter Tuning and Model Selection

Holdout validation (?)

cross validation

No Free Lunch Theorem

End-to-End Machine Learning Project

Frame the Problem

Clear the objective: How does the company expect to use and benefit form this model?

Select a Performance Measure

RMSE
MAE

Check the Assumptions

Create the Workspace

Download the Data

Take a quick Look at the Data Structure

housing.info()

housing.describe()

Create a Test Set

set aside a part of the data

avoid data snooping bias

How to create -> choose 20% of the dataset randomly (less if the dataset is very large)

Stratified sampling is important.

Discover and Visualize the Data to Gain Insights

Looking for Correlations

The correlation coefficient only measures linear correlations. It may completely miss out nonlinear relationships.

The magnitude of correlation coefficient has noting to do with the slope.

Experimenting with Attribute Combinations

Prepare the Data for Machine Learning Algorithms

write function to do that, for

reproduce the transformations easily
build a library of transformations functions that you can reuse in future projects
use these function in your live system
make it possible to try various kinds of transformations

firs is to clean the data set.

Data Cleaning

here some attribute has lost some values, you can

Get rid of the corresponding districts
Get rid of the whole attribute
Set the missing value to some value

Scikit-Learn Design

consistency
- Estimators
  - imputer
- Transformers
  - imputer
  - fit_transform() (maybe optimized and faster)
- Predictors
  - LinearRegression
  - score() to measure the quality of the predictions
- Inspection
  - hyperparameters are accessible via public variables
  - estimator’s learned parameters are also accessible via public instance variables with an underscore suffix
    - imputer.statistics_
- Nonproliferation of classes
  - dataset are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes.
- Composition
  - easy to create a Pipeline estimator from an arbitrary sequence of transformers
- Sensible defaults
  - reasonable default values for most parameters

Handling Text and Categorical Attributes

convert categories (strings) to numbers

posted @ 2022-10-31 16:58 miyasaka 阅读(34) 评论(0) 收藏举报

刷新页面返回顶部

0xfffffff