Chapter 05—Advanced data management(Part 1)

一. R的数学函数，统计函数及字符处理函数

例01：一道实际应用题

一组学生其数学，科学和英语的成绩如下表：

任务：根据成绩，决定对每个学生的单独指导；

前20%的学生的成绩为A，次之为B，以此类推；

把学生姓名按照字母表顺序排序。

问题：三科考试的分数不具可比性；必须把考试分数转化为可以比较的记分单元，然后才能进行分数间的比较。

为分配A,B等级，需把学生成绩转换为百分比形式。

姓名有单独的域，使得分配学生的任务变得更难；故应该把名字分为名和姓。

1. 数字和字符函数(Numerical(mathematical+statistical+probability) and character functions)

（1）数学函数(mathematical functions)

例02：常用的数学函数

> abs(-4)
[1] 4
> sqrt(25)
[1] 5
> ceiling(3.475)
[1] 4
> floor(5.99)
[1] 5
> trunc(5.99)
[1] 5
> round(3.475,digits=2)
[1] 3.48
> signif(3.475,digits=2)
[1] 3.5
> cos(60)
[1] -0.952413
> sin(60)
[1] -0.3048106
> tan(60)
[1] 0.3200404
> acos(-0.952413)
[1] 2.831853
> asin(-0.3048106)
[1] -0.3097396
> atan(0.3200404)
[1] 0.3097396
> cosh(2)
[1] 3.762196
> sinh(2)
[1] 3.62686
> tanh(2)
[1] 0.9640276
> acosh(3.762196)
[1] 2
> asinh(3.62686)
[1] 2
> atanh(0.9640276)
[1] 2
> log10(10)
[1] 1
> log(10)
[1] 2.302585
> exp(2.3026)
[1] 10.00015

（2）统计函数(statistical functions)

例03：常用的统计函数

> mean(c(1,2,3,4))
[1] 2.5
> median(c(1,2,3,4))
[1] 2.5
> sd(c(1,2,3,4))
[1] 1.290994
> var(c(1,2,3,4))
[1] 1.666667
> mad(c(1,2,3,4))
[1] 1.4826
> x<-c(1,2,3,4)
> range(x)
[1] 1 4
> diff(range(x))
[1] 3
> > sum(1,2,3,4)
[1] 10
> > x<-c(1,5,23,29)
> diff(x)
[1]  4 18  6
> > min(c(1,2,3,4))
[1] 1
> max(c(1,2,3,4))
[1] 4

例04：计算平均值和标准差

> x<-c(1,2,3,4,5,6,7,8)
> 
> mean(x)
[1] 4.5
> sd(x)
[1] 2.44949
> 
> n<-length(x)
> meanx<-sum(x)/n
> css<-sum((x-meanx)^2)
> sdx<-sqrt(css/(n-1))
> meanx
[1] 4.5
> sdx
[1] 2.44949

A）为标准化每一列的暂定平均数（arbitrary mean）和标准差（standard deviation），可以使用下面格式的代码：

newdata<-scale(mydata)*SD+M

其中，M是期待的均值，SD是期待的标准差。

注意：scale()在非数值的列中使用，会产生错误。

B）标准化某一特定的列，而非一个矩阵或数据帧，则可以使用下面的代码：

newdata<-transform(mydata,myvar=scale(myvar)*10+50)

·标准化myvar，使其为一个平均值为50，标准差为10的变量。

（3）概率函数(probability functions)

在R中，概率函数的形式如下：

[dpqr]=distribution_abbreviation()

·d=dentity, p=distribution function, q=quantile function, r=random generation(random deviates)

常见的概率分布如下图：

例05：从均匀分布产生一个随机数

·set.seed()函数：明确的指定种子，使求为随机数的结果，多次重复仍可得到；

·runif()函数：在均匀分布上，产生0到1之间的伪随机数。

> runif(5)
[1] 0.1476982 0.4830784 0.2800061 0.9300127 0.7477501
> runif(5)
[1] 0.6784092 0.9378730 0.1152914 0.5070802 0.1728310
> set.seed(1234)
> runif(5)
[1] 0.1137034 0.6222994 0.6092747 0.6233794 0.8609154
> set.seed(1234)
> runif(5)
[1] 0.1137034 0.6222994 0.6092747 0.6233794 0.8609154

例06：从多元正态分布中产生数据

mvnorm()函数：在MASS包中；从多元正态分布中取来自于平均值向量和协方差矩阵中取数据。

mvnorm(n,mean,sigma)

·n：希望的样例大小(desired sample size)；

·mean：平均值的向量(mean vector)；

·sigma：协方差矩阵(variance-covariance matrix)。

> library(MASS)
> options(digits=3)
> set.seed(1234)
> 
> mean<-c(230.7,146.7,3.6)
> sigma<-matrix(c(15360.8,6721.2,-47.1,6721.2,4700.9,-16.5,-47.1,-16.5,0.3),nrow=3,ncol=3)
> 
> mydata<-mvrnorm(500,mean,sigma)
> mydata<-as.data.frame(mydata)
> names(mydata)<-c("y","x1","x2")
> 
> dim(mydata)
[1] 500   3
> head(mydata,n=10)
       y    x1   x2
1  434.9 286.6 2.57
2   61.4  69.1 4.32
3  142.0 108.5 3.37
4  182.1  72.8 3.52
5  165.1 185.1 3.15
6  167.2 160.3 3.27
7  258.2 233.6 3.41
8   84.3  54.2 4.10
9  158.5 130.7 3.94
10 131.8 176.8 4.43
>

A)set.seed(1234)：设置一个随机数种子；

B)特例化平均值向量为mean，协方差矩阵为sigma；

C)产生500个伪随机数的观测量(pseudo-random observations),用mydata保存；

D)为了方便，用dim()函数，把结果从矩阵转换为数据帧，并且用head()函数，为变量赋一个名字。

（4）字符函数(character functions)

数学函数和统计函数对数值型数据处理，字符函数对文本型数据进行处理。

例07：字符函数的小例子

> x<-c("ab","cde","fghij")
> length(x)
[1] 3
> nchar(x)
[1] 2 3 5
> nchar(x[3])
[1] 5
> 
> x<-"abcdef"
> substr(x,2,4)
[1] "bcd"
> substr(x,2,4)<-"22222"
> x
[1] "a222ef"
> 
> grep("A",c("b","A","c"),fixed=TRUE)
[1] 2
> > sub("\\s",".","Hello There")
[1] "Hello.There"
>
> y<-strsplit("abc","")
> y
[[1]]
[1] "a" "b" "c"
 
> paste("x",1:3,sep="")
[1] "x1" "x2" "x3"
> paste("x",1:3,sep="M")
[1] "xM1" "xM2" "xM3"
> paste("Today is",date())
[1] "Today is Thu Aug 01 08:03:33 2013"
> 
> toupper("abc")
[1] "ABC"
> 
> tolower("ABC")
[1] "abc"
>

（5）其他函数

例08：其他函数的例子

> x<-c(2,5,6,9)
> length(x)
[1] 4
> 
> indices<-seq(1,10,2)
> indices
[1] 1 3 5 7 9
> 
> y<-rep(1:3,2)
> y
[1] 1 2 3 1 2 3
> 
> firstname<-c("Jane")
> cat("Hello",firstname,"\n")
Hello Jane 
> 
> name<-"Bob"
> cat("Hello",name,"\b.\n","Isn\'t R","\t","GREAT?\n")
Hello Bob.
 Isn't R 	 GREAT?

例09：apply函数的例子

> mydata<-matrix(rnorm(30),nrow=6)
> mydata
            [,1]        [,2]       [,3]        [,4]        [,5]
[1,] -1.70058013 -0.04791481  1.8256761  0.18096362 0.747712227
[2,]  0.61948876  0.08585212  1.0945512  0.37603502 0.002217611
[3,]  0.05312635 -0.22062291 -0.4759186  0.17347971 0.764638370
[4,] -0.72229860  0.58144854 -1.1961967 -0.07836057 0.342189978
[5,] -0.68587492 -0.93605640 -2.8959270  1.08664460 0.205011102
[6,]  0.10207794 -0.80774063  0.9329347  0.44210056 0.360341446
> apply(mydata,1,mean)
[1]  0.20117141  0.43562895  0.05894057 -0.21464346 -0.64524053  0.20594280
> apply(mydata,2,mean)
[1] -0.3890101 -0.2241723 -0.1191467  0.3634772  0.4036851
> apply(mydata,2,mean,trim=2)
[1] -0.3163743 -0.1342689  0.2285080  0.2784993  0.3512657
> apply(mydata,2,mean,trim=0.2)
[1] -0.31324231 -0.24760656  0.08884265  0.29314473  0.41381369

apply()函数：在数组的边缘使用一个函数；

lapply()函数，sapply()函数：对表使用一个函数。

apply(x,MARGIN,FUN,...)

·x：数据对象；

·MARGIN：索引的维度(dimension index);

·FUN：指定的函数；

·.......：传给函数的参数。

2. 例01的解决办法

（1）输入表中的数据

> options(digits=2)
> Students<-c("John Davis","Angela Williams","Bullwinkle Moose","David Jones","Janice Markhammer","Cheryl Cushing","Reuven Ytzrhak","Greg Knox","Joel England","Mary Rayburn")
> Math<-c(502,600,412,358,495,512,410,625,573,522)
> Science<-c(95,99,80,82,75,85,80,95,89,86)
> English<-c(25,22,18,15,20,28,15,30,27,18)
> roster<-data.frame(Students,Math,Science,English,stringAsFactors=FALSE)
> roster
            Students Math Science English stringAsFactors
1         John Davis  502      95      25           FALSE
2    Angela Williams  600      99      22           FALSE
3   Bullwinkle Moose  412      80      18           FALSE
4        David Jones  358      82      15           FALSE
5  Janice Markhammer  495      75      20           FALSE
6     Cheryl Cushing  512      85      28           FALSE
7     Reuven Ytzrhak  410      80      15           FALSE
8          Greg Knox  625      95      30           FALSE
9       Joel England  573      89      27           FALSE
10      Mary Rayburn  522      86      18           FALSE

options(digits=2)：小数点后仅保留2位数字。

（2）因为数学，科学和英语都是用不同的范围来表示的，需使这三种成绩可以比较。一种较常用的方法就是：标准化变量，使数据表现为标准差的形式。scale()函数。

z<-scale(roster[,2:4])
> z
        Math Science English
 [1,]  0.013   1.078   0.587
 [2,]  1.143   1.591   0.037
 [3,] -1.026  -0.847  -0.697
 [4,] -1.649  -0.590  -1.247
 [5,] -0.068  -1.489  -0.330
 [6,]  0.128  -0.205   1.137
 [7,] -1.049  -0.847  -1.247
 [8,]  1.432   1.078   1.504
 [9,]  0.832   0.308   0.954
[10,]  0.243  -0.077  -0.697
attr(,"scaled:center")
   Math Science English 
    501      87      22 
attr(,"scaled:scale")
   Math Science English 
   86.7     7.8     5.5

（3）通过计算每个同学的平均分，即每行的平均数，来衡量一个学生表现的好坏。用到mean()函数和cbind()函数。

> score<-apply(z,1,mean)
> score
 [1]  0.56  0.92 -0.86 -1.16 -0.63  0.35 -1.05  1.34  0.70 -0.18
> roster<-cbind(roster,score)
> roster
            Students Math Science English stringAsFactors score
1         John Davis  502      95      25           FALSE  0.56
2    Angela Williams  600      99      22           FALSE  0.92
3   Bullwinkle Moose  412      80      18           FALSE -0.86
4        David Jones  358      82      15           FALSE -1.16
5  Janice Markhammer  495      75      20           FALSE -0.63
6     Cheryl Cushing  512      85      28           FALSE  0.35
7     Reuven Ytzrhak  410      80      15           FALSE -1.05
8          Greg Knox  625      95      30           FALSE  1.34
9       Joel England  573      89      27           FALSE  0.70
10      Mary Rayburn  522      86      18           FALSE -0.18

（4）quantile()函数给出每个学生的表现分数的百分比的排名。

> y<-quantile(roster$score,c(.8,.6,.4,.2))
> y
  80%   60%   40%   20% 
 0.74  0.44 -0.36 -0.89

（5）使用逻辑操作符，在一个新的成绩变量的分类中，这需要在数据帧roster添加新的变量grade。

> roster$grade[score>=y[1]]<-"A"
> roster$grade[score<y[1]&score>=y[2]]<-"B"
> roster$grade[score<y[2]&score>=y[3]]<-"C"
> roster$grade[score<y[3]&score>=y[4]]<-"D"
> roster$grade[score<y[4]]<-"F"
> roster
            Students Math Science English stringAsFactors score grade
1         John Davis  502      95      25           FALSE  0.56     B
2    Angela Williams  600      99      22           FALSE  0.92     A
3   Bullwinkle Moose  412      80      18           FALSE -0.86     D
4        David Jones  358      82      15           FALSE -1.16     F
5  Janice Markhammer  495      75      20           FALSE -0.63     D
6     Cheryl Cushing  512      85      28           FALSE  0.35     C
7     Reuven Ytzrhak  410      80      15           FALSE -1.05     F
8          Greg Knox  625      95      30           FALSE  1.34     A
9       Joel England  573      89      27           FALSE  0.70     B
10      Mary Rayburn  522      86      18           FALSE -0.18     C

（6）使用strsplit()函数：让一个向量返回一个表。在此题中，把学生的名和姓的中间的空格符去掉。

> name<-strsplit(roster$Students," ")
Error in strsplit(roster$Students, " ") : non-character argument
> is.character(roster$character)
[1] FALSE
> name<-strsplit(as.character(roster$Students)," ")
> name
[[1]]
[1] "John"  "Davis"

[[2]]
[1] "Angela"   "Williams"

[[3]]
[1] "Bullwinkle" "Moose"     

[[4]]
[1] "David" "Jones"

[[5]]
[1] "Janice"     "Markhammer"

[[6]]
[1] "Cheryl"  "Cushing"

[[7]]
[1] "Reuven"  "Ytzrhak"

[[8]]
[1] "Greg" "Knox"

[[9]]
[1] "Joel"    "England"

[[10]]
[1] "Mary"    "Rayburn"

（7）sapply()函数：提取每个组合的第一个元素，放在名的数组中；提取每个组合的第二个元素，放在姓的数组中。

cbind()函数：把姓和名的数组添加到roster中。

删除roster中的Students这一列。

Firstname<-sapply(name,"[",1)
> Lastname<-sapply(name,"[",2)
> roster<-cbind(Firstname,Lastname,roster[,-1])
> roster
    Firstname   Lastname Math Science English stringAsFactors score grade
1        John      Davis  501      95      25           FALSE  0.56     B
2      Angela   Williams  600      99      22           FALSE  0.92     A
3  Bullwinkle      Moose  412      80      18           FALSE -0.86     D
4       David      Jones  358      82      15           FALSE -1.16     F
5      Janice Markhammer  495      75      20           FALSE -0.63     D
6      Cheryl    Cushing  512      85      28           FALSE  0.35     C
7      Reuven    Ytzrhak  410      80      15           FALSE -1.05     F
8        Greg       Knox  625      95      30           FALSE  1.34     A
9        Joel    England  573      89      27           FALSE  0.70     B
10       Mary    Rayburn  522      86      18           FALSE -0.18     C

（8）通过Firstname和Lastname排列数据集roster

> roster[order(Lastname,Firstname),]
    Firstname   Lastname Math Science English stringAsFactors score grade
6      Cheryl    Cushing  512      85      28           FALSE  0.35     C
1        John      Davis  501      95      25           FALSE  0.56     B
9        Joel    England  573      89      27           FALSE  0.70     B
4       David      Jones  358      82      15           FALSE -1.16     F
8        Greg       Knox  625      95      30           FALSE  1.34     A
5      Janice Markhammer  495      75      20           FALSE -0.63     D
3  Bullwinkle      Moose  412      80      18           FALSE -0.86     D
10       Mary    Rayburn  522      86      18           FALSE -0.18     C
2      Angela   Williams  600      99      22           FALSE  0.92     A
7      Reuven    Ytzrhak  410      80      15           FALSE -1.05     F

posted @ 2013-07-31 21:18 seven_wang 阅读(454) 评论(0) 收藏举报

刷新页面返回顶部