Linear Regression - Subset Selection
Linear Regression - Subset Selection
1 Best Subset Selection
- \(2^p\) models, \(p\) is the number of predictors
Algorithm:
-
Let \(\mathcal{M}_0\) denote the null model, which only contains constant item but no predictors.
This model simply predicts the sample mean for each observation. -
For \(k = 1, 2, \cdots, p\) :
-
Fit all \(\displaystyle \binom{p}{k}\) models that contain exactly \(k\) predictors.
-
Pick the best among these \(\displaystyle \binom{p}{k}\) models, and call it \(\mathcal{M}_K\).
Here best is defined as having the smallest \(\text{RSS}\), or equivalently largest \(R^2\).
-
-
Select a single best model from among \(\mathcal{M}_0, \mathcal{M}_1, \cdots, \mathcal{M}_p\) using cross-validated prediction error, such as, \(C_p (\text{AIC})\), \(\text{BIC}\), or adjusted \(R^2\).
Note: The \(\text{RSS}\) of these \(p + 1\) models decreases monotonically, and the \(R^2\) increases monotonically, as the number of features included in the models increases. Therefore, in Step 3, use cross-validated prediction error, \(Cp\), \(\text{BIC}\), or adjusted \(R^2\) to compare the models with the different number of predictors.
2 Forward Stepwise Selection
Algorithm:
-
Let \(\mathcal{M}_0\) denote the null model, which only contains constant item but no predictors.
-
For \(k = 1, 2, \cdots, p\) :
-
Consider all \(p − k\) models that augment the predictors in \(\mathcal{M}_k\) with one additional predictor.
-
Choose the best among these \(p − k\) models, and call it \(\mathcal{M}_{k+1}\)
Here best is defined as having the smallest \(\text{RSS}\), or equivalently largest \(R^2\).
-
-
Select a single best model from among \(\mathcal{M}_0, \mathcal{M}_1, \cdots, \mathcal{M}_p\) using cross-validated prediction error, such as, \(C_p (\text{AIC})\), \(\text{BIC}\), or adjusted \(R^2\).
3 Backward Stepwise Selection
Algorithm:
-
Let \(\mathcal{M}_p\) denote the all model, which contains all \(p\) predictors.
-
For \(k = p, p-1, \cdots, 1\) :
-
Consider all \(k\) models that contain all but one of the predictors in \(\mathcal{M}_k\), for a total of \(k − 1\) predictors.
-
Choose the best among these \(k\) models, and call it \(\mathcal{M}_{k-1}\)
Here best is defined as having the smallest \(\text{RSS}\), or equivalently largest \(R^2\).
-
-
Select a single best model from among \(\mathcal{M}_0, \mathcal{M}_1, \cdots, \mathcal{M}_p\) using cross-validated prediction error, such as, \(C_p (\text{AIC})\), \(\text{BIC}\), or adjusted \(R^2\).
4 Hybrid Approaches
Stepwise selection (sequential replacement), which is a combination of forward and backward selections. Start with no predictors, then sequentially add the most contributive predictors (like forward selection). After adding each new variable, remove any variables that no longer provide an improvement in the model fit (like backward selection).
Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection.
5 Implement by R
5.1 regsubsets function in leaps package
5.1.1 安装
安装
install.packages("leaps")
regsubsets() function,主要参数
-
nvmax: 最大子集(变量)个数 -
intercept:bool;是否包含 intercept -
method:str,"exhaustive", "backward", "forward", "seqrep"; -
force.in:str向量 或bool向量;强制保留的变量 -
nbest:int;在每个 step 中,保留最好的几个模型
5.1.2 实例
- 加载数据
library(ISLR)
names(Hitters) # 变量名
Hitters = na.omit(Hitters) # 删除含有缺失数据的样本
dim(Hitters)
sum(is.na(Hitters))
- Best Subset Selection
regfit.full = regsubsets(Salary ~ ., data=Hitters, nvmax=19)
reg.summary = summary (regfit.full)
reg.summary
查看每个 step 模型的统计指标,如 \(R^2\)("rsq"), \(\text{RSS}\)("rss"), adjusted \(R^2\)("adjr2"), \(\text{BIC}\)("bic")等
names(reg.summary)
# 输出
# [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg. summary$rsq
- 绘图
regsubsets() function 内置函数 plot()
plot(regfit.full, scale="r2")
plot(regfit.full, scale="adjr2")
plot(regfit.full, scale ="Cp")
plot(regfit.full, scale ="bic")
- 模型的系数和协方差矩阵
coef(regfit.full, 1:3) # 输出模型 1 ~ 3 的系数
vcov(regfit.full, 1)
regfit.fwd = regsubsets(Salary ~ ., data=Hitters, nvmax=19, method="forward")
summary(regfit.fwd)
regfit.bwd = regsubsets(Salary ~ ., data=Hitters, nvmax=19, method="backward")
summary(regfit.bwd)
regfit.seq = regsubsets(Salary ~ ., data=Hitters, nvmax=19, method="seqrep")
summary(regfit.seq)
5.2 step 函数
或 stepAIC 函数 in MASS package
根据 AIC 准则(AIC最小)逐步选择回归。当 AIC 不在减小时,停止 step
step 函数主要参数
-
direction:str类型,"both","backward", or"forward" -
trace:bool类型 或int类型;整数表示输出 step regression 的详细过程
实例
lm.inter_only = lm(Salary ~ 1, data=Hitters)
lm.all = lm(Salary ~ ., data=Hitters)
# Forward Stepwise Selection
step.fwd = step(lm.inter_only, direction="forward", scope=formula(lm.all), trace=1)
summary(step.fwd)
# Backward Stepwise Selection
step.bwd = step(lm.all, direction="backward", scope=formula(lm.all), trace=1)
summary(step.bwd)
References
[1] T. Hastie, R. Tibshirani and J. Friedman, "3.3 Subset Selection" in The Elements of Statistical Learning Second Edition. New York, NY: Springer New York, 2009. p.p. 57-60.
[2] G. James, D. Witten, T. Hastie, and R. Tibshirani, "6.1 Subset Selection" in An introduction to statistical learning: with applications in R. New York: Springer, 2013. p.p. 205-210.
[3] 'regsubsets' functions for model selection in Package 'leaps', 地址, 或 地址, 或 地址.
[4] A Complete Guide to Stepwise Regression in R, Statology, 地址
[5] Choose a model by AIC in a Stepwise Algorithm, 地址, 或 地址, 或 地址.

浙公网安备 33010602011771号