6001 sheet

下面是一份 超级齐全的 Machine Learning Cheat Sheet（中英对照），
覆盖你前面所有 讲义 8 章总结 + 150 道预测题内容 所涉及的全部知识点。

1. Machine Learning Taxonomy 机器学习分类

Supervised Learning 监督学习

Uses labeled data (X, y)
使用带标签的数据
Tasks: Classification, Regression
分类、回归
Examples: spam detection, churn prediction, house price
垃圾邮件检测、流失预测、房价预测

Unsupervised Learning 无监督学习

Uses only X
只使用特征数据
Tasks: Clustering, Dimensionality Reduction
聚类、降维
Examples: customer segmentation, PCA, anomaly detection
客户群分、PCA、异常检测

2. Classification vs Regression 分类与回归

Classification 分类

Output is discrete
输出是离散标签
Examples: spam/not spam, fraud detection
垃圾邮件、欺诈检测

Regression 回归

Output is continuous
输出连续数值
Examples: price, temperature
价格、温度

3. Important Algorithms 关键算法

Logistic Regression

Binary classification
二分类
Uses sigmoid → probability (0–1)
用 sigmoid 输出概率
Linear decision boundary
线性决策边界

KNN

Non-parametric, lazy learner
非参数、惰性学习
Sensitive to feature scaling
对特征缩放敏感
Predict by nearest neighbors
通过最近邻预测（投票/平均）

Decision Tree

Uses Gini / Entropy
使用 Gini/Entropy
Easy to interpret
可解释性强
Overfits if not pruned
不剪枝易过拟合

Random Forest

Bagging of decision trees
树的 Bagging
Reduces variance
降低方差
More stable than single tree
比单棵树更稳

K-Means

Requires K
必须指定 K
Minimizes within-cluster distance
最小化簇内距离
Sensitive to scale
对缩放敏感

PCA

Unsupervised dimensionality reduction
无监督降维
Keeps directions with highest variance
保留最大方差方向
Removes correlated features
去相关

4. Key Concepts & Terminology 核心概念

Sigmoid

Output range: 0–1
范围 0–1

Decision Boundary

The line/plane separating classes
分类边界线/面

Centroid (K-Means)

Cluster center
聚类中心

False Positive / False Negative

FP = predicted positive but actually negative
预测正类但实际为负类（误报）
FN = predicted negative but actually positive
预测负类但实际为正类（漏报）

5. Model Evaluation 模型评估

Confusion Matrix

	Pred +	Pred –
Actual +	TP	FN
Actual –	FP	TN

Metrics 指标

Accuracy 准确率

Not reliable under imbalance
类别不平衡时不可靠

Precision 精确率

TP / (TP + FP)
关注预测为正的是否正确

Recall 召回率

TP / (TP + FN)
关注是否漏掉正例

F1-score

Harmonic mean of P & R
精确率与召回率的调和平均

AUC-ROC

Probability model ranks positive > negative
衡量分类器排序能力

RMSE / MSE (Regression)

Measure error for regression
回归误差指标

6. Overfitting vs Underfitting 过拟合与欠拟合

Overfitting 过拟合

High train accuracy
Low test accuracy
训练高、测试低
Model too complex

Underfitting 欠拟合

Low train
Low test
训练和测试都差
Model too simple

7. How to Reduce Overfitting 如何减少过拟合

✓ Regularization (L1 / L2) 正则化
✓ Early stopping 提前停止
✓ More training data 更多数据
✓ Cross-validation 交叉验证
✓ Simpler model 简单模型
✗ Increasing model complexity 会更严重

8. Regularization 正则化

L1 (Lasso)

Shrinks weights to zero → feature selection
可做特征选择

L2 (Ridge)

Shrinks weights but not zero
收缩权重，不变零

Purpose: prevent overfitting
目的：减少过拟合

9. Feature Engineering 特征工程

Scaling / Normalization

标准化与归一化

Needed for KNN, KMeans, Logistic Regression
KNN/KMeans/逻辑回归需要缩放

One-Hot Encoding

独热编码

For categorical features
处理类别特征

Handling Missing Values

处理缺失值

Mean/median imputation 均值/中位数填补
KNN imputer
Drop columns/rows（合理时）

10. Cross-Validation 交叉验证

K-Fold CV

Split into K folds
Train K times
Stable evaluation
稳健评估方式

Train/Validation/Test Split

Train: fit model
Validation: tune hyperparameters
Test: final evaluation
训练 → 验证 → 测试

11. Data Leakage 数据泄漏

Avoid:

Scaling before split
Using target-derived features
Fitting preprocessing on full dataset

避免：

在划分前缩放
用包含未来/标签的信息
在全体数据上 fit 预处理

12. Imbalanced Data 不平衡数据

To handle imbalance:
处理方法：

Class weights 类别权重
Oversampling (SMOTE) 过采样
Undersampling 欠采样
Precision/Recall/PR Curve 专注正类指标

Avoid accuracy as only metric
不要只看准确率

13. Gradient Descent 梯度下降

Too large LR → divergence

学习率太大→不收敛

Too small LR → slow

太小→极慢

Used in:

Logistic regression
Linear regression
Neural networks

14. Common Exam Patterns 常考题型总结

✔ Identify supervised vs unsupervised
✔ Classification vs regression
✔ Sigmoid range
✔ Interpretation of FP/FN
✔ When to use precision vs recall
✔ KNN sensitivity to scaling
✔ PCA purpose
✔ K-Means requires K
✔ Overfitting signs
✔ Methods to reduce overfitting
✔ Use of cross-validation
✔ Decision tree impurity metrics
✔ Handling missing values
✔ Imbalanced dataset metrics
✔ Train-test split usage

With this cheat sheet, you can answer ALL the predictions questions.

posted @ 2025-11-28 19:36 Stéphane 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

Stéphane