R Programming: Part 2 - Programming with R

Part 2 - Programming with R

Contents

2.1 Control Structures

2.2 Functions in R
 2.2.1 Declarations of Functions
 2.2.2 Arguments of Functions

2.3 Scoping Rules
 2.3.1 Lexical Scoping & Function Closure
 2.3.2 Lexical scoping vs. Dynamic scoping

2.4 More About Data
 2.4.1 Have a Glimpse of Data
 2.4.2 Dates and Times

2.5 Programming Assignment: Air Pollution

2.1 Control Structures

If-else structure:
if (<condition>) {#do something} else {#do something else}
The following acts the same:

> if (x > 3) y <- 10 else y <- 20
> y <- if (x > 3) 10 else 20

For loop structure:
for (<iterator> in <vector>) {#do something}
The following are equivalent (assuming x is a vector):

> for (i in 1:length(x)) print(x[i])
> for (i in seq_along(x)) print(x[i])
> for (I in seq_len(length(x))) print(x[i])
> for (letter in x) print(letter)

While loop structure:
while (<condition>) {#do something}

Repeat loop structure:
repeat {#do something}
It initiates an infinite loop. The only way to exit it is to call break.

Next is to skip an iteration of a loop.
Return is to exit the entire function.

Control structures mentioned above is useful for writing programs. But for command line interactive work, *apply functions are more useful (Will be discussed in the next part).

Extra: The function rbinom(r, n, p) performs r independent tests which follow binomial distribution with n trials and success probability p (X ~ B(n, p)), e.g. rbinom(10, 1, 0.5) is able to simulate the results of 10 coin flips.

2.2 Functions in R

2.2.1 Declarations of Functions

Functions are R objects as well. Functions can be created through

f <- function(<arguments>) {
	## Do something interesting
}

My first R function:

addTwo <- function(x, y) {
    x + y
}

The function will return whatever the value of the last expression is.
Another example:

column.means <- function(x)
{
    means <- vector("numeric", length = ncol(x))
    for (i in 1:ncol(x))
        means[i] <- mean(x[, i])
    means
}

Run all: Option + Command + R
Run selected lines: Command + Enter

2.2.2 Arguments of Functions

Function arguments can have default values and can be matched by position or by name. When an argument is matched by name, it is “removed” from the argument list and the remaining are matched in position order.

> args(lm)
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...)

the following two calls are equivalent:

> lm(data = mydata, y ~ x, model = FALSE, 1:100)
> lm(y ~ x, mydata, 1:100, model = FALSE)

Arguments matching follows the order of:

  1. Check for an exact match for a named argument;
  2. Check for a partial match;
  3. Check for a positional match.

Null can serve as default value which stands for nothing.

Arguments to functions are only evaluated when needed (lazy evaluation), i.e.

f <- function(a, b) {
	2 * a
}
f(2)

will not throw any error.

The “…” argument:

  • It indicates a number of variables that are usually passed on to other functions, especially when you don’t want to copy the entire argument list.
myplot <- function(x, y, type, …) {
	plot(x, y, type = type, …)
}
  • Used in generic functions (More about this later).
  • Used when arguments passed to the function cannot be known in advance. Arguments that appear after … must be named explicitly and cannot be partially matched.
> args(paste)
function (..., sep = " ", collapse = NULL)
> paste("a", "b", sep = ":")
[1] "a:b"
> paste("a", "b", se = "c")
[1] "a b c"

2.3 Scoping Rules

When there are several functions sharing one name, R will first bind the function with the implementation in the global environment that matches the one requested. If that does not exist, R will search the namespaces of each package loaded into R on the search list, which can be accessed through search() function.

> search()
 [1] ".GlobalEnv"        "package:swirl"     "tools:rstudio"    
 [4] "package:stats"     "package:graphics"  "package:grDevices"
 [7] "package:utils"     "package:datasets"  "package:methods"  
[10] "Autoloads"         "package:base"     

The global environment, i.e. the user’s workspace is always the first, and the base package is always the last.
In the search process, the order of the packages matters!
When user loads a new package with library() function, the namespace of that package will be put in the second position (by default).
R has separate namespaces for functions and non-functions, so it’s possible to have an object and a function sharing the same name.

2.3.1 Lexical Scoping & Function Closure

The scoping rules, which set R apart from S, determines how a value is associated with a free variable in a function. R uses lexical scoping or called static scoping, alternative to dynamic scoping.

Lexical scoping: The values of free variables are searched for in the environment in which the function was defined.
With lexical scope, a name always refers to its (more or less) local lexical environment. (from Wikipedia)

An environment is a collection of (symbol, value) pairs. Every environment has a parent environment that it inherits from (except the empty environment of course).

A function + An environment = A closure.
A function (actually an object) is uniquely associated with its environment once it was defined, which houses all the (symbol, value) pairs corresponding to this function at declaration time.

f <- function(x, y) {
	x ^ 2 + y / z
}

the variable z here is called free variable.

The principle of binding symbols with values:
If the value of a free variable is not found in the environment in which the function was defined, then the search continues recursively through the parent environments until we hit the top-level environment (usually the global environment or the namespace of a package where the function was defined). After arriving at the top-level environment, the search continues down the search list until it hits the empty environment, where an error will be thrown.

An application of lexical scoping: constructor functions

make.power <- function(n) {
	pow <- function(x) {
		x ^ n
	}
	pow
}

Here n is a free variable for function pow and can be found in the environment it was defined, i.e. the body of function make.power. Make.power can derive a lot of functions by returning a function, e.g.

> cube <- make.power(3)
> square <- make.power(2)
> cube(3)
[1] 27
> square(3)
[1] 9

Ls(environment(< object >)) is used to investigate the function closure, while get(< symbol >, environment(< object >)) gives specific value of the symbol.

> ls(environment(cube))
[1] "n"   "pow"
> get("n", environment(cube))
[1] 3

2.3.2 Lexical scoping vs. Dynamic scoping

A fundamental distinction in scoping is what "part of a program" means.

In languages with lexical scope (also called static scope), name resolution depends on the location in the source code and the lexical context, which is defined by where the named variable or function is defined.

In contrast, in languages with dynamic scope, the name resolution depends upon the program state and the execution context (or calling context), which is defined by where the named variable or function is called from.

In practice, with lexical scope a variable's definition is resolved by searching its containing block or function, then if that fails, searching the outer containing block, and so on, whereas with dynamic scope the calling function is searched, then the function which called that calling function, and so on, progressing up the call stack.

y <- 10
f <- function(x) {
    y <- 2
    g(x)
}
g <- function(x) {
    y
}
f(1)

If we run the program above, for the free variable y in function g, lexical scoping will find its value in the environment where the function g is defined, i.e. the global environment, thus returning 10, while dynamic scoping will find the value in the environment where g is called, i.e. the function f, thus returning 2.

Most modern languages use lexical scoping for variables and functions (or with some limitations), but dynamic scoping is still used in some languages, notably some dialects of Lisp.

2.4 More About Data

2.4.1 Have a Glimpse of Data

There are a lot of functions that are able to give us information about the datasets.

Class() returns the class of an object.
Head() returns the first part of the data (first six rows by default).
Tail() returns the last part of the data (last six rows by default).
Object.size() returns how much space an object is occupying in memory (in bytes).

Summary() function returns differently for different data types.

  • For vectors, it will return 6 values indicating the minimum, 1/4 quantile, median (a.k.a 1/2 quantile), mean, 3/4 quantile and the maximum respectively.
  • For matrices and data frames, the function will compute the 6 results above for each column.
  • For factors, it will work in a way similar to table() function, displaying absolute frequencies of each level.

Str() function (structure) is used to compactly display the internal structure of an arbitrary R object. It displays roughly one line for each atomic object.

> library(datasets)   ## Retrieve data from built-in datasets
> class(airquality)
[1] "data.frame"
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> tail(airquality)
    Ozone Solar.R Wind Temp Month Day
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30
> object.size(airquality)
5496 bytes

> summary(airquality$Ozone)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
   1.00   18.00   31.50   42.13   63.25  168.00      37 
> summary(airquality)
     Ozone           Solar.R           Wind             Temp           Month            Day      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
 NAs    :37       NAs    :7 
 
> str(airquality)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

2.4.2 Dates and Times

Dates and times are represented by the Date class and the POSIXct/POSIXlt class respectively. Dates are stored internally as the number of days since 1970-01-01.

> x <- as.Date("1970-01-02")  ## Character -> Date
> x
[1] "1970-01-02"
> unclass(x)                  ## A number underneath
[1] 1

POSIXct class stores times as very large integers, representing the number of seconds since 1970-01-01, while POSIXlt class is a list underneath and stores other information including day of the week, day of the month, day of the year etc.

Weekdays(), months(), quarters() are some functions that work on dates and times, which return the day of the week, the month name, the quarter number ("Q1", "Q2", "Q3" or "Q4") respectively.
Sys.time() gives the current time as a POSIXct class object.

> x <- Sys.time()    ## POSIXct class object
> x
[1] "2016-07-12 23:53:02 CST"
> unclass(x)
[1] 1468338782

> x <- as.POSIXlt(x)
> unclass(x)[1:4]
$sec
[1] 2.385467
$min
[1] 53
$hour
[1] 23
$mday
[1] 12

Strptime() function can be used to turn a character string into a date class object according to a specified format.
Simple arithmetic operations, like +, -, ==, <, >, can be performed on date class objects.

2.5 Programming Assignment: Air Pollution

For this programming assignment, you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site.

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:

  • Date: the date of the observation in YYYY-MM-DD format (year-month-day)
  • sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
  • nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory 'specdata'. Once you have unzipped the zip file, do not make any modifications to the files in the 'specdata' directory. In each file you'll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

Part 1

Write a function named 'pollutantmean' that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function 'pollutantmean' takes three arguments: 'directory', 'pollutant', and 'id'. Given a vector monitor ID numbers, 'pollutantmean' reads that monitors' particulate matter data from the directory specified in the 'directory' argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

Sample inputs and outputs:

> source("pollutantmean.R")
> pollutantmean("specdata", "sulfate", 1:10)
 [1] 4.064
> pollutantmean("specdata", "nitrate", 70:72)
 [1] 1.706

My code:

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    totsum = 0
    totnum = 0
    for (i in id)
    {
        poldata = read.csv(list.files(directory, full.names = TRUE)[i])
        totsum = totsum + sum(poldata[[pollutant]], na.rm = TRUE)
        totnum = totnum + sum(!is.na(poldata[[pollutant]]))
    }
    totsum / totnum
}

It is recommended not to change the working directory to avoid unnecessary troubles.

Methods for accessing the specified file without changing current working directory (assuming id = c(1, 10, 100)):

i. List.files() with full.names = TRUE

> list.files(directory, full.names=TRUE)[id]
[1] "specdata/001.csv" "specdata/010.csv" "specdata/100.csv"

ii. Use paste() and formatC(), similar to output stream manipulators in C++

> paste(directory, "/", formatC(id, width = 3, flag = "0"), ".csv", sep="")
[1] "specdata/001.csv" "specdata/010.csv" "specdata/100.csv"

iii. Use sprintf(), similar to formatted output in C

> sprintf("%s/%03d.csv", directory, id)
[1] "specdata/001.csv" "specdata/010.csv" "specdata/100.csv"

More about the code:
In R, "=" can also be used to assign values, just similar to "<-". LOL.
But be careful when you are passing values to arguments in a function, you'd better use "=" operator instead of "<-".

totsum = totsum + sum(poldata[[pollutant]], na.rm = TRUE)

To compute the sum, use sum() function instead of a loop. Sum() has an option na.rm, a logical value indicating whether or not to remove the NA values while computing the sum.

totnum = totnum + sum(!is.na(poldata[[pollutant]]))

To compute the number of non-missing values, !is.na() gives a logical vector where NA values correspond to 0 and non-NA correspond to 1. Thus the sum will give the desired result. Note here we use [[]] to subset the data frame instead of $ because "pollutant" is not the literal name.

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

Sample inputs and outputs:

> source("complete.R")
> complete("specdata", 1)
##   id nobs
## 1  1  117
> complete("specdata", c(2, 4, 8, 10, 12))
##   id nobs
## 1  2 1041
## 2  4  474
## 3  8  192
## 4 10  148
## 5 12   96
> complete("specdata", 30:25)
##   id nobs
## 1 30  932
## 2 29  711
## 3 28  475
## 4 27  338
## 5 26  586
## 6 25  463

My code:

complete <- function(directory, id = 1:332)
{
    nobs = vector("numeric", length = length(id))
    for (i in seq_along(id))
    {
        poldata = read.csv(list.files(directory, full.names = TRUE)[id[i]])
        nobs[i] = sum(complete.cases(poldata))
    }
    data.frame(id = id, nobs = nobs)
}

###Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than or equal to the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0.

For this function you will need to use the 'cor' function in R which calculates the correlation between two vectors. Please read the help page for this function via '?cor' and make sure that you know how to use it.

Sample inputs and outputs:

> source("corr.R")
> cr <- corr("specdata", 150)
> head(cr)
[1] -0.01896 -0.14051 -0.04390 -0.06816 -0.12351 -0.07589
> summary(cr)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.21060 -0.05383  0.08884  0.11990  0.25980  0.76310

My code:

corr <- function(directory, threshold = 0)
{
    correlation = vector("numeric")
    cnt = 0
    for (i in 1:322)
    {
        poldata = read.csv(list.files(directory, full.names = TRUE)[i])
        poldata = poldata[complete.cases(poldata), ]
        if (nrow(poldata) < threshold) next
        cnt = cnt + 1
        correlation[cnt] = cor(poldata$nitrate, poldata$sulfate)
    }
    correlation
}
posted @ 2016-07-20 19:59  二和鶏  阅读(790)  评论(0编辑  收藏  举报