R语言主成分分析——prcomp VS princomp

最简单的主成分分析函数,prcomp 和 princomp 都是自带的函数,不需要额外的包

http://strata.uga.edu/software/pdf/pcaTutorial.pdf很好的一个介绍

http://gastonsanchez.wordpress.com/2012/06/17/principal-components-analysis-in-r-part-1/很好的一个介绍

主成分分析的结果包含特征根集,PC scores表,(变量和PC)相关系数表(table of loadings)

特征根包含了数据变化度的信息,scores提供了观测结构的信息,相关系数表提供了变量之间,以及和PC之间的关系的大致感官概念

描述:

prcomp : Performs a principalcomponents analysis on the givendata matrix and returns the results as anobject of class prcomp.

princomp : Performs a principal components analysison the givennumeric data matrix and returns the results as an object of class princomp.
使用:

以下使用内置数据集USArrests

str(USArrests)

'data.frame': 50 obs. of 4 variables:

$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...

$ Assault : int NA 263 294 190 276 204 110 238 335 211 ...

$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...

$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

prcomp :

prcomp(x, ...)

prcomp(formula, data = NULL, subset, na.action, ...)

prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)

prcomp(USArrests)  #inappropriate,没有scale不太合适

prcomp(USArrests, scale = TRUE) #直接数据矩阵

prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) #直接方程

plot(prcomp(USArrests))

summary(prcomp(USArrests, scale = TRUE))

biplot(prcomp(USArrests, scale = TRUE))

princomp :

princomp(x, ...) #完全一样

princomp(formula, data = NULL, subset, na.action, ...) #继续完全一样

princomp(x, cor = FALSE, scores = TRUE, covmat = NULL, subset = rep(TRUE,nrow(as.matrix(x))), ...) #参数变化

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE) 近似但不完全一样,标准差differ by a factor of sqrt(49/50)

summary(pc.cr <- princomp(USArrests, cor = TRUE))

loadings(pc.cr)  #一个列包含了特征向量的矩阵,对应rotation in prcomp

plot(pc.cr) # shows a screeplot.

biplot(pc.cr)

返回值:
prcomp :

sdev

标准差

the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix).

rotation

特征向量矩阵

the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). The function princomp returns this in the element loadings.

x

在retx值为true的情况下,返回旋转后的数据,也就是(centred (and scaled if requested) data multiplied by the rotation matrix). 所以, cov(x) 就是矩阵对角元素(sdev^2). For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action.

center, scale

the centering and scaling used, or FALSE.
因为PCA必须建立在标准正态数据上(mean=0, variance=1)所以通常需要标准化。

princomp :

sdev

标准差

the standard deviations of the principal components.

loadings

特征向量矩阵

the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). This is of class "loadings": see loadings for its print method.

center

the means that were subtracted.

scale

the scalings applied to each variable.

n.obs

the number of observations.

scores

if scores = TRUE, the scores of the supplied data on the principal components. These are non-null only if x was supplied, and if covmat was also supplied if it was a covariance list. For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action.

call

the matched call.

na.action

If relevant.

细节:
prcomp :
The calculation is doneby a singular value decomposition奇异值分解 of the (centered and possibly scaled) datamatrix, not by using eigen on the covariance matrix而不使用协方差矩阵的特征根. This is generally the preferred method for numerical accuracy提高数值型准确性.

The print method for these objects prints the results in a nice format and theplot method produces a screeplot.

Unlike princomp, variances are computed with the usual divisor N - 1.

Note that scale= TRUE cannot be used if there are zero or constant(for center = TRUE) variables.
princomp :

princomp is a generic function with "formula" and "default" methods.

The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. Apreferred method of calculation is to use svd on x, as is done in prcomp.

Note that the default calculation uses divisor N for the covariance matrix.

The print method for these objects prints the results in a nice formatand the plot method produces a scree plot (screeplot).There is also a biplot method.

If x is a formula then the standard NA-handling is applied to the scores (if requested): seenapredict.

princomp only handles so-calledR-mode PCA, that is feature extraction of variables. If a data matrix is supplied (possibly via a formula) it is required that there are at least as many units as variables. ForQ-mode PCA use prcomp.
R和Q-Mode区别:
R-mode PCA examines the correlations or covariances among variables变量的相关性和协方差
Q-mode focusses on the correlations or covariances among samples.样本的相关性和协方差

通常多变量分析,例如计算相关系数,是在数据列(features或者Question)上完成的;然而每一行是一个样本单位sample unit,也就是Respondents(R way analysis)

有时候数据列Question被当做样本单位那么就是Q analysis. 区别也许就在于标准化和结果解释的时候。
使用PCA结果进行回归分析
参考http://sites.stat.psu.edu/~ajw13/stat505/fa06/16_princomp/10_princomp_reg_example.htm

回归的一大问题是多重共线性对结果的干扰。对此提出了解决方法PCA回归
原始数据有很高的VI(最后一列大于4的都算比较大)

                         Parameter Estimates

               Parameter      Standard                           Variance

Variable DF Estimate Error t Value Pr > |t| Inflation

Intercept 1 134.96790 237.81430 0.57 0.5778 0
occup 1 -1.28377 0.80469 -1.60 0.1291 2.16276
checkin 1 1.80351 0.51624 3.49 0.0028 4.52397
hours 1 0.66915 1.84640 0.36 0.7215 1.35735
common 1 -21.42263 10.17160 -2.11 0.0504 2.33264
wings 1 5.61923 14.74609 0.38 0.7079 3.65318
cap 1 -14.48025 4.22018 -3.43 0.0032 37.12912
rooms 1 29.32475 6.36590 4.61 0.0003 63.70809

特征值Eigenvalue ,也代表了样本variance覆盖率

Eigenvalues of the Correlation Matrix

             Eigenvalue Difference    Proportion    Cumulative

        1    4.64302239    3.90281147        0.6633        0.6633
        2    0.74021092    0.03390878        0.1057        0.7690
        3    0.70630215    0.25669541        0.1009        0.8699
        4    0.44960674    0.15020062        0.0642        0.9342
        5    0.29940611    0.14798282        0.0428        0.9769
        6    0.15142329    0.14139489        0.0216        0.9986
        7    0.01002840                      0.0014        1.0000

主成分之间的VI完美为1

Parameter Estimates

                                         Variance
                     Variable    DF     Inflation

                     Intercept    1             0
                     Prin1        1       1.00000
                     Prin2        1       1.00000
                     Prin3        1       1.00000
                     Prin4        1       1.00000
                     Prin5        1       1.00000
                     Prin6        1       1.00000
                     Prin7        1       1.00000

参考资料###

http://blog.csdn.net/youliye/article/details/16892723

posted @ 2017-05-31 08:45  ywliao  阅读(...)  评论(...编辑  收藏