6001 sheet
下面是一份 超级齐全的 Machine Learning Cheat Sheet(中英对照),
覆盖你前面所有 讲义 8 章总结 + 150 道预测题内容 所涉及的全部知识点。
1. Machine Learning Taxonomy 机器学习分类
Supervised Learning 监督学习
- Uses labeled data (X, y)
使用带标签的数据 - Tasks: Classification, Regression
分类、回归 - Examples: spam detection, churn prediction, house price
垃圾邮件检测、流失预测、房价预测
Unsupervised Learning 无监督学习
- Uses only X
只使用特征数据 - Tasks: Clustering, Dimensionality Reduction
聚类、降维 - Examples: customer segmentation, PCA, anomaly detection
客户群分、PCA、异常检测
2. Classification vs Regression 分类与回归
Classification 分类
- Output is discrete
输出是离散标签 - Examples: spam/not spam, fraud detection
垃圾邮件、欺诈检测
Regression 回归
- Output is continuous
输出连续数值 - Examples: price, temperature
价格、温度
3. Important Algorithms 关键算法
Logistic Regression
- Binary classification
二分类 - Uses sigmoid → probability (0–1)
用 sigmoid 输出概率 - Linear decision boundary
线性决策边界
KNN
- Non-parametric, lazy learner
非参数、惰性学习 - Sensitive to feature scaling
对特征缩放敏感 - Predict by nearest neighbors
通过最近邻预测(投票/平均)
Decision Tree
- Uses Gini / Entropy
使用 Gini/Entropy - Easy to interpret
可解释性强 - Overfits if not pruned
不剪枝易过拟合
Random Forest
- Bagging of decision trees
树的 Bagging - Reduces variance
降低方差 - More stable than single tree
比单棵树更稳
K-Means
- Requires K
必须指定 K - Minimizes within-cluster distance
最小化簇内距离 - Sensitive to scale
对缩放敏感
PCA
- Unsupervised dimensionality reduction
无监督降维 - Keeps directions with highest variance
保留最大方差方向 - Removes correlated features
去相关
4. Key Concepts & Terminology 核心概念
Sigmoid
- Output range: 0–1
范围 0–1
Decision Boundary
- The line/plane separating classes
分类边界线/面
Centroid (K-Means)
- Cluster center
聚类中心
False Positive / False Negative
FP = predicted positive but actually negative
预测正类但实际为负类(误报)
FN = predicted negative but actually positive
预测负类但实际为正类(漏报)
5. Model Evaluation 模型评估
Confusion Matrix
| Pred + | Pred – | |
|---|---|---|
| Actual + | TP | FN |
| Actual – | FP | TN |
Metrics 指标
Accuracy 准确率
Not reliable under imbalance
类别不平衡时不可靠
Precision 精确率
TP / (TP + FP)
关注预测为正的是否正确
Recall 召回率
TP / (TP + FN)
关注是否漏掉正例
F1-score
Harmonic mean of P & R
精确率与召回率的调和平均
AUC-ROC
Probability model ranks positive > negative
衡量分类器排序能力
RMSE / MSE (Regression)
Measure error for regression
回归误差指标
6. Overfitting vs Underfitting 过拟合与欠拟合
Overfitting 过拟合
- High train accuracy
- Low test accuracy
训练高、测试低 - Model too complex
Underfitting 欠拟合
- Low train
- Low test
训练和测试都差 - Model too simple
7. How to Reduce Overfitting 如何减少过拟合
✓ Regularization (L1 / L2) 正则化
✓ Early stopping 提前停止
✓ More training data 更多数据
✓ Cross-validation 交叉验证
✓ Simpler model 简单模型
✗ Increasing model complexity 会更严重
8. Regularization 正则化
L1 (Lasso)
- Shrinks weights to zero → feature selection
可做特征选择
L2 (Ridge)
- Shrinks weights but not zero
收缩权重,不变零
Purpose: prevent overfitting
目的:减少过拟合
9. Feature Engineering 特征工程
Scaling / Normalization
标准化与归一化
- Needed for KNN, KMeans, Logistic Regression
KNN/KMeans/逻辑回归需要缩放
One-Hot Encoding
独热编码
- For categorical features
处理类别特征
Handling Missing Values
处理缺失值
- Mean/median imputation 均值/中位数填补
- KNN imputer
- Drop columns/rows(合理时)
10. Cross-Validation 交叉验证
K-Fold CV
- Split into K folds
- Train K times
- Stable evaluation
稳健评估方式
Train/Validation/Test Split
- Train: fit model
- Validation: tune hyperparameters
- Test: final evaluation
训练 → 验证 → 测试
11. Data Leakage 数据泄漏
Avoid:
- Scaling before split
- Using target-derived features
- Fitting preprocessing on full dataset
避免:
- 在划分前缩放
- 用包含未来/标签的信息
- 在全体数据上 fit 预处理
12. Imbalanced Data 不平衡数据
To handle imbalance:
处理方法:
- Class weights 类别权重
- Oversampling (SMOTE) 过采样
- Undersampling 欠采样
- Precision/Recall/PR Curve 专注正类指标
Avoid accuracy as only metric
不要只看准确率
13. Gradient Descent 梯度下降
Too large LR → divergence
学习率太大→不收敛
Too small LR → slow
太小→极慢
Used in:
- Logistic regression
- Linear regression
- Neural networks
14. Common Exam Patterns 常考题型总结
✔ Identify supervised vs unsupervised
✔ Classification vs regression
✔ Sigmoid range
✔ Interpretation of FP/FN
✔ When to use precision vs recall
✔ KNN sensitivity to scaling
✔ PCA purpose
✔ K-Means requires K
✔ Overfitting signs
✔ Methods to reduce overfitting
✔ Use of cross-validation
✔ Decision tree impurity metrics
✔ Handling missing values
✔ Imbalanced dataset metrics
✔ Train-test split usage
With this cheat sheet, you can answer ALL the predictions questions.

浙公网安备 33010602011771号