Machine Learning Notes

Machine Learning

Classification

\[\begin{aligned} &\text { sensitivity }=\frac{\text { true positive }}{\text { true positive }+\text { false negative }} \\ &\text { specificity }=\frac{\text { true negative }}{\text { true negative }+\text { false positive }} \\ &\text { positive predictive value }=\frac{\text { true positive }}{\text { true positive }+\text { false positive }} \\ &\text { negative predictive value }=\frac{\text { true negative }}{\text { true negative }+\text { false negative }} \end{aligned} \]

sensitivity = recall, 真的是阳性的里面检测出来的比例

specificity = precision, 真的是阴性里面检测出来的比例

positive predictive value, 检测出来的阳性里面真的阳性有多少

negative predictive value, 检测出来的阴性里面真的阴性有多少

如果是要筛查一种罕见病,我们关心sensitivity,我们不想放过任何一个有癌症的人

如果是要诊断是否实施手术,我们关心specificity, 我们不希望不需要做手术的人做了手术

如何测试Classifier

  • leave one out class

    • small number of samples
  • repeated random subsampling

    • larger set of data
    • 比如80%训练,20%测试

logistic regression

  • designed explicitly for probability prediction
    • dependent variable can only take on a finite set of values
      • usually 0 or 1
  • Finds weights for each feature
    • positive implies variable positively correlated with outcome
      • 比如有鳞片和是爬行动物正相关
    • negative implies variable negatively correlated with outcome
      • 比如腿的数量越多,越不可能是爬行动物
    • absolute magnitude related to strength of the correlation

用优化问题找出 weight 的最优解


Hands On Machine Learning with Scikit Lean Keras and Tensorflow

? Batch and Online Learning

Most important neural net architectures:

  • feedforward neural nets,
  • convolutional nets
  • recurrent nets,
  • long short-term memory (LSTM) nets,
  • autoencoders
  • generative adversarial networks (GANs).

Jargons

examples that system uses to learn is called training set

each training example is called a training instance (or sample)

e.g. flagging spams

  • task T flag spam for new mails
  • experience E is the training data
  • performance measure P needs to be defined
    • ratio of correctly classified emails ~ accuracy

Supervised learning

  • k Nearest Neighbor
  • Linear Regression
  • Logistic Regression
  • Support Vector Machines
  • Decision Trees and Random Forests
  • Neural networks

Unsupervised learning

  • Clustering

    • K Means
    • DBSCAN
    • Hierarchical Cluster Analysis (HCA)
  • Anomaly detection and novelty detection

    • One-class SVM
    • Isolation Forest
  • Visualization and dimensionality reduction

    • PCA
    • kernel PCA
    • LLE
    • t-SNE
  • Association rule learning

Semi-supervised learning

Reinforcement Learning

RL is a very different beast

  • AlphaGo

define a policy

  • rewards
  • penalty

Batch and Online Learning

Batch learning

offline learning

Online learning

Can be used to trains systems on huge datasets cannot fit machine's main memory

-> out-of-core learning (usually done offline)

important parameter:

  • how fast they should adapt to changing data
    • -> learning rate

Instance-Based vs Model-Based Learning

summary:

  • studied the data
  • select a model
  • trained it on the training data
    • learning algorithm searched for the model parameter values that minimize a cost function
  • Applied the model to make prediction on out-of-sample data
    • inference

Main Challenges of Machine Learning

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.

The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model)

Insufficient Quantity of Training data

The Unreasonable Effectiveness of Data

  • Data matters more than algorithms

Nonerepresentative Training Data

  • Sampling bias

Poor-Quality Data

  • some outliers
  • missing features

Irrelevant Features

  • Feature selection
  • Feature extraction
  • Creating new features

Overfitting the Training Data

  • Simplify the model by selection one with fewer parameters
    • choose a linear model rather than a polynomial model
  • reducing the numbers of attributes or by constraining the model
  • Gather more training data
  • reduce the noise in the training data

Underfitting the Training Data

  • Selecting a more powerful model, with more parameters

  • Feeding better features to the learning algorithm (feature engineering)

  • Reducing the constrains on the model

    • reducing the constrains on the hyperparameter

Test and Validating

common to use 80% of the data for training and hold out 20% for testing

Hyperparameter Tuning and Model Selection

Holdout validation (?)

cross validation

No Free Lunch Theorem

End-to-End Machine Learning Project

Frame the Problem

  • Clear the objective: How does the company expect to use and benefit form this model?

Select a Performance Measure

  • RMSE
  • MAE

Check the Assumptions

Create the Workspace

Download the Data

Take a quick Look at the Data Structure

housing.info()

housing.describe()

Create a Test Set

set aside a part of the data

avoid data snooping bias

How to create -> choose 20% of the dataset randomly (less if the dataset is very large)

Stratified sampling is important.

Discover and Visualize the Data to Gain Insights

Looking for Correlations

The correlation coefficient only measures linear correlations. It may completely miss out nonlinear relationships.

The magnitude of correlation coefficient has noting to do with the slope.

Experimenting with Attribute Combinations

Prepare the Data for Machine Learning Algorithms

write function to do that, for

  • reproduce the transformations easily
  • build a library of transformations functions that you can reuse in future projects
  • use these function in your live system
  • make it possible to try various kinds of transformations

firs is to clean the data set.

Data Cleaning

here some attribute has lost some values, you can

  • Get rid of the corresponding districts
  • Get rid of the whole attribute
  • Set the missing value to some value

Scikit-Learn Design

  • consistency
    • Estimators
      • imputer
    • Transformers
      • imputer
      • fit_transform() (maybe optimized and faster)
    • Predictors
      • LinearRegression
      • score() to measure the quality of the predictions
    • Inspection
      • hyperparameters are accessible via public variables
      • estimator’s learned parameters are also accessible via public instance variables with an underscore suffix
        • imputer.statistics_
    • Nonproliferation of classes
      • dataset are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes.
    • Composition
      • easy to create a Pipeline estimator from an arbitrary sequence of transformers
    • Sensible defaults
      • reasonable default values for most parameters

Handling Text and Categorical Attributes

convert categories (strings) to numbers

posted @ 2022-10-31 16:58  miyasaka  阅读(34)  评论(0)    收藏  举报