6001 sheet

下面是一份 超级齐全的 Machine Learning Cheat Sheet(中英对照)
覆盖你前面所有 讲义 8 章总结 + 150 道预测题内容 所涉及的全部知识点。


1. Machine Learning Taxonomy 机器学习分类

Supervised Learning 监督学习

  • Uses labeled data (X, y)
    使用带标签的数据
  • Tasks: Classification, Regression
    分类、回归
  • Examples: spam detection, churn prediction, house price
    垃圾邮件检测、流失预测、房价预测

Unsupervised Learning 无监督学习

  • Uses only X
    只使用特征数据
  • Tasks: Clustering, Dimensionality Reduction
    聚类、降维
  • Examples: customer segmentation, PCA, anomaly detection
    客户群分、PCA、异常检测

2. Classification vs Regression 分类与回归

Classification 分类

  • Output is discrete
    输出是离散标签
  • Examples: spam/not spam, fraud detection
    垃圾邮件、欺诈检测

Regression 回归

  • Output is continuous
    输出连续数值
  • Examples: price, temperature
    价格、温度

3. Important Algorithms 关键算法

Logistic Regression

  • Binary classification
    二分类
  • Uses sigmoid → probability (0–1)
    用 sigmoid 输出概率
  • Linear decision boundary
    线性决策边界

KNN

  • Non-parametric, lazy learner
    非参数、惰性学习
  • Sensitive to feature scaling
    对特征缩放敏感
  • Predict by nearest neighbors
    通过最近邻预测(投票/平均)

Decision Tree

  • Uses Gini / Entropy
    使用 Gini/Entropy
  • Easy to interpret
    可解释性强
  • Overfits if not pruned
    不剪枝易过拟合

Random Forest

  • Bagging of decision trees
    树的 Bagging
  • Reduces variance
    降低方差
  • More stable than single tree
    比单棵树更稳

K-Means

  • Requires K
    必须指定 K
  • Minimizes within-cluster distance
    最小化簇内距离
  • Sensitive to scale
    对缩放敏感

PCA

  • Unsupervised dimensionality reduction
    无监督降维
  • Keeps directions with highest variance
    保留最大方差方向
  • Removes correlated features
    去相关

4. Key Concepts & Terminology 核心概念

Sigmoid

  • Output range: 0–1
    范围 0–1

Decision Boundary

  • The line/plane separating classes
    分类边界线/面

Centroid (K-Means)

  • Cluster center
    聚类中心

False Positive / False Negative

FP = predicted positive but actually negative
预测正类但实际为负类(误报)
FN = predicted negative but actually positive
预测负类但实际为正类(漏报)


5. Model Evaluation 模型评估

Confusion Matrix

Pred + Pred –
Actual + TP FN
Actual – FP TN

Metrics 指标

Accuracy 准确率

Not reliable under imbalance
类别不平衡时不可靠

Precision 精确率

TP / (TP + FP)
关注预测为正的是否正确

Recall 召回率

TP / (TP + FN)
关注是否漏掉正例

F1-score

Harmonic mean of P & R
精确率与召回率的调和平均

AUC-ROC

Probability model ranks positive > negative
衡量分类器排序能力

RMSE / MSE (Regression)

Measure error for regression
回归误差指标


6. Overfitting vs Underfitting 过拟合与欠拟合

Overfitting 过拟合

  • High train accuracy
  • Low test accuracy
    训练高、测试低
  • Model too complex

Underfitting 欠拟合

  • Low train
  • Low test
    训练和测试都差
  • Model too simple

7. How to Reduce Overfitting 如何减少过拟合

✓ Regularization (L1 / L2) 正则化
✓ Early stopping 提前停止
✓ More training data 更多数据
✓ Cross-validation 交叉验证
✓ Simpler model 简单模型
✗ Increasing model complexity 会更严重


8. Regularization 正则化

L1 (Lasso)

  • Shrinks weights to zero → feature selection
    可做特征选择

L2 (Ridge)

  • Shrinks weights but not zero
    收缩权重,不变零

Purpose: prevent overfitting
目的:减少过拟合


9. Feature Engineering 特征工程

Scaling / Normalization

标准化与归一化

  • Needed for KNN, KMeans, Logistic Regression
    KNN/KMeans/逻辑回归需要缩放

One-Hot Encoding

独热编码

  • For categorical features
    处理类别特征

Handling Missing Values

处理缺失值

  • Mean/median imputation 均值/中位数填补
  • KNN imputer
  • Drop columns/rows(合理时)

10. Cross-Validation 交叉验证

K-Fold CV

  • Split into K folds
  • Train K times
  • Stable evaluation
    稳健评估方式

Train/Validation/Test Split

  • Train: fit model
  • Validation: tune hyperparameters
  • Test: final evaluation
    训练 → 验证 → 测试

11. Data Leakage 数据泄漏

Avoid:

  • Scaling before split
  • Using target-derived features
  • Fitting preprocessing on full dataset

避免:

  • 在划分前缩放
  • 用包含未来/标签的信息
  • 在全体数据上 fit 预处理

12. Imbalanced Data 不平衡数据

To handle imbalance:
处理方法:

  • Class weights 类别权重
  • Oversampling (SMOTE) 过采样
  • Undersampling 欠采样
  • Precision/Recall/PR Curve 专注正类指标

Avoid accuracy as only metric
不要只看准确率


13. Gradient Descent 梯度下降

Too large LR → divergence

学习率太大→不收敛

Too small LR → slow

太小→极慢

Used in:

  • Logistic regression
  • Linear regression
  • Neural networks

14. Common Exam Patterns 常考题型总结

✔ Identify supervised vs unsupervised
✔ Classification vs regression
✔ Sigmoid range
✔ Interpretation of FP/FN
✔ When to use precision vs recall
✔ KNN sensitivity to scaling
✔ PCA purpose
✔ K-Means requires K
✔ Overfitting signs
✔ Methods to reduce overfitting
✔ Use of cross-validation
✔ Decision tree impurity metrics
✔ Handling missing values
✔ Imbalanced dataset metrics
✔ Train-test split usage

With this cheat sheet, you can answer ALL the predictions questions.


posted @ 2025-11-28 19:36  Stéphane  阅读(0)  评论(0)    收藏  举报