统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification
response variable:
- quantitative
- qualitative / categorical
methods for classification
- first predict the probability that the observation belongs to each of the categories of a qualitative variable, then the response can be seen as a member of the max probability class.
Reading tips:
- The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular Possion regression.
4.1 Overview
In classicification setting, we have a set of training observations \((x_1,y_1),...,(x_n,y_n)\) that we can use to build a classifier.
Data Set: Default
, columns income
, balance
,
4.2 Why not Linear Regression?
For example, $$
\begin{align}
\begin{split}
Y{}= \left {
\begin{array}{lr}
1 & \text{if stroke}\
2 & \text{if drug overdose}\
3 & \text{if epileptic seizure}
\end{array}
\right.
\end{split}
\end{align}$$
Facts:
- This coding implies an ordering on the outcomes, putting
drug overdose
in betweenstroke
andepileptic seizure
, and insisting that the difference betweenstroke
anddrug overdose
is the same as the difference betweendrug overdose
andepileptic seizure
. - \(0.5*\text{epileptic seizure} + 0.5*\text{stroke} == 1 * \text{drug overdose}\)
- If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.
- Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
- For a binary response with a \(0/1\) coding as above, regression by least squares is not completely unreasonable: it can be shown that the \(X\hat \beta\) obtained using linear regression is in fact an estimate of \(Pr(\text{drug overdose}|X)\) in this special case. However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities.
Summary: There are at least two reasons not to perform classification using a regression method:
(a) a regression method cannot accommodate a qualitative response with more than two classes;
(b) a regression method will not provide meaningful estimates of \(Pr(Y |X)\), even with just two classes.
4.3 Logistic Regression
Logistic regression models the probability that Y belongs to a particular category.
4.3.1 The Logistic Model
First, consider a linear regression model: \(p(X)=\beta_0+\beta_1 X\). Any time a straight line is fit to a binary response that is coded as \(0\) or \(1\), in principle we can always predict \(p(X) < 0\) for some values of \(X\) and \(p(X) > 1\) for others (unless the range of \(X\) is limited). To avoid this problem, we must model \(p(X)\) using a function that gives outputs between \(0\) and \(1\) for all values of \(X\). In logistic regression, we use the logistic function,$$p(X)=\frac{e{\beta_0+\beta_1X}}{1+e{\beta_0+\beta_1X}}$$
To fit the model above, we use a method called maximum likelihood Estimation. (MLE)
After a bit of manipulation of above equation, we find that $$\frac{p(X)}{1 - p(X)} = e^{\beta_0+\beta_1X}$$
The quantity \(p(X)/[1-P(X)]\) is called the odds, and can take on any value between \(0\) and \(\infty\).
By taking the logarithm of both sides, we arrive at $$log\Big(\frac{p(X)}{1-p(X)}\Big)=\beta_0+\beta_1X$$
The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in \(X\).
The amount that \(p(X)\) changes due to a one-unit change in \(X\) depends on the current value of \(X\). But regardless of the value of \(X\), if \(\beta_1\) is positive then increasing \(X\) will be associated with increasing \(p(X)\), and if \(\beta_1\) is negative then increasing \(X\) will be associated with decreasing \(p(X)\).
4.3.2 Estimating the Regression Coefficients
The basic intuitition behiind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for \(\beta_0\) and \(\beta_1\) such that the predicted probability \(\hat p(x_i)\) of default for each individual, corresponds as closely as possible to the individual's observerd default status.
Here, $$\mathcal{l}(\beta_0,\beta_1)=\Pi_{i:y_i = 1}p(x_i)\Pi_{i \prime:y_{i \prime = 0}} (1 - p(x_{i \prime}))$$
4.3.3 Making Predictions
Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book.
For qualitative predictors with 2 categoties, we may simply create a dummy variable that takes 0/1.
4.3.4 Multiple Logistic Regression
where \(X = (X_1,...,X_p)\) are predictors.
The above equation is equivalent to $$p(X) = \frac{e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}$$
confounding
For Default
dataset, the negative coefficient for student
in the multiple logistic regression indicates that for a fixed value of balance
and income
, a student is less likely to default then a non-student.
But for the bar plot, which shows the default rates for students and non-students averaged over all values of balance
and income
, suggest the opposite effect: the overall student default rate is higher than the non-student default rate.
Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.
![[Pasted image 20221017160629.png]]
![[Pasted image 20221017160707.png]]
(Confounding) This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant. As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.
4.3.5 Multinomial Logistic Regression
Want to do: classify a response variable that has more than two classes
However, the logistic regression approach that we have seen in this section only allows for K = 2 classes for the response variable.
It turns out that it is possible to extend the two-class logistic regression approach to the setting of K > 2 classes. This extension is sometimes known as multinomial logistic regression.
- we first select a single multinomial logistic regression class to serve as the baseline; without loss of generality, we select the Kth class for this role. Then we replace the model by $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ for \(k=1,...,K-1\) and $$Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}{K-1}e+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$
- so that $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p$$ for \(k=1,...K-1\). This indicate the log odds between any pair of classes is linear in the features.
- .The coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same.
Softmax coding
The softmax coding is equivalent softmax to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding.
In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for \(k = 1,...,K\), $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$Thus, rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the \(k\)th and \(k′\) th classes equals $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=k\prime|X=x)})=(\beta_{k0}-\beta_{k\prime 0}) + (\beta_{k1}-\beta_{k\prime 1}) x_1+...+(\beta_{kp}-\beta_{k\prime p})x_p $$
4.4 Generative Models for Classification
In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X.
In this new approach, we model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these around into estimates for \(Pr(Y = k|X = x)\). When the distribution of X within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.
advantages:
- When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. The methods that we consider in this section do not suffer from this problem.
- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
- The methods in this section can be naturally extended to the case of more than two response classes. (In the case of more than two response classes, we can also use multinomial logistic regression from Section 4.3.5.)
Let \(f_k(X) ≡ Pr(X|Y = k)\) denote the density function of X for an observation that comes from the \(k\)th class. Then Bayes’ theorem states that $$Pr(Y= k|X = x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}$$
- \(p_k(x) = Pr(Y = k|X = x)\) is the posterior probability that an observation posterior \(X = x\) belongs to the \(k\)th class.
- In general, estimating πk is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the \(k\)th class.
- As we will see, to estimate \(f_k(x)\), we will typically have to make some simplifying assumptions.
4.4.1 Linear Discriminant Analysis for p=1
Assumptions:
- \(f_k(x)\) is normal or Gaussian
- \(\sigma_1^2=...=\sigma_K^2=\sigma^2\) (a shared variance)
so we have $$p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_k)2)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_l)2)}$$
Taking log, this is equivalent to assigning the observation to the class for which \(\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\) is largest.
The Bayes decision boundary is the point for which \(\delta_1(x) = \delta_2(x)\); one can show that this amounts to$$x=\frac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2)} = \frac{\mu_1 + \mu_2}{2}$$ (Assume \(\pi_i = \pi_j\) for all i,j)
In practice, the following estimates are used: $$\hat \mu_k = \frac{1}{n_k}\sum_{i:y_i=k}x_i$$$$\hat \sigma^2 = \frac{1}{n-K} \sum_{k=1}^{K}\sum_{i:y_i=k}(x_i-\hat \mu_k)^2$$
discriminant functions \(\hat \delta_k(x)\) are linear functions of \(x\).
summary: the LDA classifier results from assuming that the observations within each class come from a normal distribution with a classspecific mean and a common variance \(\sigma^2\), and plugging estimates for these parameters into the Bayes classifier.
4.4.2 Linear Discriminant Analysis for p > 1
Assumptions:
- X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix.
Note: the multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.
An example in which \(Var(X_1) = Var(X_2)\) and \(Cor(X_1, X_2) = 0\); this surface has a characteristic bell shape. (就像一个中心对称的窝窝头),如果\(Cor(X_1, X_2) \neq 0\), 就是一个被压扁了的窝窝头。
![[Pasted image 20221017165257.png]]
we write \(X \sim N(\mu, \Sigma)\). Here \(E(X) = \mu\) is the mean of X (a vector with p components), and \(Cov(X) = \Sigma\) is the p × p covariance matrix of \(X\). Formally, the multivariate Gaussian density is defined as $$f(x)=\frac{1}{(2\pi){p/2}|\Sigma|{1/2}}exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$$
The discriminant functions \(\delta_k(x)\) are$$\delta_k(x)=xT\Sigma\mu_k-\frac{1}{2}\mu_kT\Sigma\mu_k + log \pi_k$$are linear functions of \(x\). The Bayes decision boundaries are $$xT\Sigma\mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k = x^T \Sigma^{-1}\mu_l - \frac{1}{2}\mu_lT\Sigma\mu_l$$ (Assume \(\pi_k = \pi_j\) for all k,j)
Summary:
- we need to estimate the unknown parameters \(\mu_1,...,\mu_K, \pi_1,..., \pi_K\), and \(\Sigma\);
- To assign a new observation \(X = x\), LDA plugs these estimates to obtain quantities \(\hat \delta_k(x)\), and classifies to the class for which \(\hat \delta_k(x)\) is largest.
- \(\hat \delta_k(x)\) is a linear function of \(x\); that is, the LDA decision rule depends on \(x\) only through a linear combination of its elements.
Class-specific performance is also important in medicine and biology, where the terms sensitivity and specificity characterize the performance of sensitivity specificity a classifier or screening test.
LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.
The Bayes classifier works by assigning an observation to the class for which the posterior probability \(p_k(X)\) is greatest. Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50 % for the posterior probability of default in order to assign an observation to the default class. However, this threshold can be changed to 80% or other percentage.
Figure 4.7 illustrates the trade-off that results from modifying the threshold value for the posterior probability of default.
.As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases.
![[Pasted image 20221017190149.png]]
ROC curve
The ROC (receiver operating characteristics) curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.
![[Pasted image 20221017190418.png]]
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.
![[Pasted image 20221017190710.png]]
4.4.3 Quadratic Discriminant Analysis
Assumptions:
- (like LDA) each class are drawn from a Gaussian distribution
- (like LDA) each class has its own mean vector
- (unlike LDA) each class has its own covariance matrix
the Bayes classifier assigns an observation \(X = x\) to the class for which $$\delta_k(x) = -\frac{1}{2}(x-\mu_k)T\Sigma_k(x-\mu_k)-\frac{1}{2}log|\Sigma_k|+log \pi_k \
= -\frac{1}{2}xT\Sigma_kx+xT\Sigma_k\mu_k-\frac{1}{2}\mu_kT\Sigma_k\mu_k-\frac{1}{2}log|\Sigma_k|+log\pi_k$$ is largest. The quantity \(x\) appears as a quadratic function.
why would one prefer LDA to QDA, or vice-versa?
The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters.
LDA is a much less flexible classifier than QDA, and so has substantially lower variance. But there is a trade-off: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.
Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.
4.4.4 Naive Bayes
Assumption:
- Within the \(k\)th class, the \(p\) predictors are independent. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal),i.e. for \(k = 1,...,K\), \(f_k(x) = f_{k1}(x_1) × f_{k2}(x_2) × ··· × f_{kp}(x_p)\), where \(f_{kj}\) is the density function of the \(j\)th predictor among observations in the \(k\)th class.
Why is this assumption so powerful?
Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal distribution each predictor on its own — but also the joint distribution of the predictors joint distribution — that is, the association between the different predictors.
it often leads to pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class.
Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.
To estimate the one-dimensional density function \(f_{kj}\) using training data \(x_{1j} ,...,x_{nj}\) , we have a few options:
- If \(X_j\) is quantitative, then we can assume that \(X_j |Y = k \sim N(\mu_{jk}, \sigma^2_{jk})\). While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.
- If \(X_j\) is quantitative, then another option is to use a non-parametric estimate for \(f_{kj}\) . A very simple way to do this is by making a histogram for the observations of the \(j\)th predictor within each class. Then we can estimate \(f_{kj}(x_j)\) as the fraction of the training observations in the kth class that belong to the same histogram bin as \(x_j\) . Alternatively, we can use a kernel density estimator, which is essentially a smoothed version of a histogram.
- If \(X_j\) is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.
We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.
4.5 A Comparison of Classification Methods
4.5.1 An Analytical Comparison
Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})$$1. For LDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}^pb_{kj}x_j$$ where \(a_k = log(\frac{\pi_k}{\pi_K}) - \frac{1}{2}(\mu_k+\mu_K)^T\Sigma^{-1}(\mu_k - \mu_K)\) and \(b_{kj}\) is the \(j\)th component of \(\Sigma^{-1}(\mu_k-\mu_K)\).
- For QDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}pb_{kj}x_j+\sum_{j=1}p\sum_{l=1}^p c_{kjl}x_jx_l$$ where \(a_k\), \(b_{kj}\) , and \(c_{kjl}\) are functions of \(\pi_k\), \(\pi_K\), \(\mu_k\), \(\mu_K\), \(\Sigma_k\) and \(\Sigma_K\).
- for naive Bayes $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = a_k + \sum_{j=1}^pg_{kj}(x_j)$$ where \(a_k = log(\frac{\pi_k}{\pi_K})\) and \(g_{kj}(x_j)= log(\frac{f_{kj(x_j)}}{f_{Kj}(x_j)})\). Hence, the right-hand side of takes the form of a generalized additive model (CH7) .
Summary1:
- LDA is a special case of QDA with \(c_{kjl} = 0\) for all \(j = 1,...,p\), \(l = 1,...,p\), and \(k = 1,...,K\). (Of course, this is not surprising, since LDA is simply a restricted version of QDA with \(\Sigma_1 = ··· = \Sigma_K = \Sigma.\))
- Any classifier with a linear decision boundary is a special case of naive Bayes with \(g_{kj}(x_j) = b_{kj}x_j\) . In particular, this means that LDA is a special case of naive Bayes!
- If we model \(f_{kj} (x_j )\) in the naive Bayes classifier using a one-dimensional Gaussian distribution \(N(\mu_{kj} , \sigma^2_j )\), then we end up with \(g_{kj} (x_j ) = b_{kj}x_j\) where \(b_{kj} = (\mu_{kj} - \mu_{Kj} )/\sigma^2_j\) . In this case, naive Bayes is actually a special case of LDA with \(\Sigma\) restricted to be a diagonal matrix with jth diagonal element equal to \(\sigma^2_j\) .
- Neither QDA nor naive Bayes is a special case of the other. Naive Bayes can produce a more flexible fit, since any choice can be made for \(g_{kj} (x_j )\). However, it is restricted to a purely additive fit, however, these terms are never multiplied. By contrast, QDA includes multiplicative terms of the form \(c_{kjl}x_jx_l\). Therefore, QDA has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes.
- for multinomial logistic regression, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=\beta_{k0}+\sum_{l=1}^p\beta_{kl}x_l$$ This is identical to the linear form of LDA, In LDA, those coefficients are functions of estimates by assuming that \(X_1,...,X_p\) follow a normal distribution. However, in logistic regression, the coefficients are chosen to maximize the likelihood function,.Thus, we expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not.
- for K-nearest neighbors (KNN), in order to make a prediction for an observation \(X = x\), the training observations that are closest to \(x\) are identified. Then \(X\) is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: **no assumptions are made about the shape of the decision boundary
Summary2:
- Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that \(n\) is very large and \(p\) is small.
- KNN requires a lot of observations relative to the number of predictors—that is, \(n\) much larger than \(p\).
- KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
- In settings where the decision boundary is non-linear but \(n\) is only modest, or \(p\) is not very small, then QDA may be preferred to KNN.
- Unlike logistic regression, KNN does not tell us which predictors are important.
4.5.2 An Empirical Comparison
- When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
- When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.
- for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
- Finally, recall from Chapter 3 that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including \(X_2, X_3\), and even \(X_4\) as predictors. If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different.
4.6 Generalized Linear Models
dataset: Bikeshare
4.6.2 Poisson Regression on the Bikeshare Data
Poisson Distribution: