Case Study: Processing Historical Weather Pattern Data (with R)

1. Introduction

Few days ago I read a case study article about Processing Historical Weather.

I founnd this article valuable for learning. It describes a whole data analysis process:

1. Web scraping. Request an URL and get it's respond. Extract useful infornatiom for us from the respond.

2. File manipulation. Download zip-files from internet and write them into disk. Unzip files.

3. Data transforming. Read-in data files. Merge them into a whole dataframe. Do some data transformation and cleaning.

The original article is written in Python. I do use Python on one hand, but I also use R daily. 

I decide to try how I can do the same things in R, even I know there are some topics are well known for Python(like web scraping).

There are interesting stories and problem background behind the process so I recommend you also read the original article.

 

2. Web Scraping

First step is about web scraping. I won't copy and paste all the Python code from original article. 

It basically do three things:

1. Use Python requests library to visit and get respoind from an URL(it is "German Weather Service" aka "DWD").

2. Use famous BeautifulSoup library to analyse the respond and extract information we need. Which are URLs for each weather data file downloading.

3. Use Python requests library to request the download URL. Then use base Python open() function to write the downloaded data into disk.

Below, I will follow this process but in R language.

2.1 How to do web scraping in R?

In R, we first need a library equivalent to "requests" and "BeautifulSoup". Luckly, we have tow in one package: rvest.

We can use rvest::read_html() to visit an URL and get it's respond.

If you print out the respond, it is only lists of html taged document.

library(xml2)
library(rvest)

base_url <- 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'

base_resp <- read_html(base_url)

2.2 How to analysis the html respond and extract information in R?

We can think of the respond as a list contains "chaos" information. How can we analyse it and extract useful information?

We can use rvest::html_nodes() to analyse the respond, which is eqivalent to the job normally done by BeautifulSoup in Python.

A common web scraping will stop here, because the information or data we need will contain inside the chaos html respond document.

For example, after we web scraping a store's pages we can located and output the prices and item names.

However, in our case study, which is even more valuable for learning, the wbesite contains the URLs for data file downloading.

Under tag "a" there are URLs for data files(which are zip-files). All the data files are name as "stundenwerte_TU*" so we do a little match filter by grepl().

zip_names <- html_nodes(base_resp, "a") %>% html_attr("href")
zip_names <- zip_names[grepl("stundenwerte_TU", zip_names)] # regex

2.3 How to download data files from their URLs in R?

I am surprised that there are a base R function called download.file(), which can request downloadable URL and write infomation into disk directly.

It's even more convenient than the combo of request and open() in Python.

file_max <- 5
dl_count <- 0

for(i in 1:length(zip_names)) {
  zip_url <- paste0(base_url, zip_names[i])
  download.file(zip_url, paste0(download_folder, zip_names[i]))
  dl_count <- dl_count + 1
  if(dl_count >= file_max) {
    break
  }
}

2.4  Step Summary

Here are the code for first section: web scarping & download.  

library(xml2)
library(rvest)

base_url <- 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'

base_resp <- read_html(base_url)

zip_names <- html_nodes(base_resp, "a") %>% html_attr("href")
zip_names <- zip_names[grepl("stundenwerte_TU", zip_names)] # regex

if(!file.exists("downloadr")) {
  dir.create("./downloadr/")
}
download_folder <- "./downloadr/"

file_max <- 5
dl_count <- 0

for(i in 1:length(zip_names)) {
  zip_url <- paste0(base_url, zip_names[i])
  download.file(zip_url, paste0(download_folder, zip_names[i]))
  dl_count <- dl_count + 1
  if(dl_count >= file_max) {
    break
  }
}

 

3. File Manipulation

 Second step is to unzip the zip-files. However, each zip file contains more than weather data that we need, but also contains some meta-data files for descrbing.

Let's see what we can do in R.

3.1 How to get access to all files in a folder in R?

In Python, we have golb library to do this. In R, we can do the same with file.list().

files <- list.files(download_folder, pattern="stundenwerte_TU")

It's different to glob.glob() in Python, we still need to complete the whole path for the files so that we can use read csv function.

Later we will do this path building job with a simple string concat function call paste0().

3.2 How to unzip files?

In R we can use unzip() function to unzip files and it is a base R function. This function has three main purposes:

1. Get files list inside the zip.

We use list=TRUE argument to request this purpose. Then we use grepl() function to do regular expression, matching only those files we need.

Each weather data file has name like "produkt".

2. Unzip all files in zip or unzip only one file you specify.

Use files= argument to specify file name we want, and it will ignore those we haven't specified.

3. Output unzip files into folder we want.

Use exdir= argument to specify which folder we want to output. Be warning, I am surprised this function has a bug here:

The string we use to specify folder path cannot ends with a slash "/". Otherwise it will stop with warning.(Some other articles online reported this bug also)

for(i in 1:length(files)) {
  list_of_filenames <- unzip(paste0(download_folder, files[i]), list=TRUE)
  extract_filename <- list_of_filenames[grepl("produkt", list_of_filenames[, "Name"]), "Name"]
  unzip(paste0(download_folder, files[i]), files=extract_filename, exdir=import_folder) # there is a bug in exdir= argument, it can not be ended with slash /
}

3.3 Step Summary

download_folder <- "./downloadr/"
import_folder <- "./importr"
if(!file.exists(import_folder)) {
  dir.create(import_folder)
}

files <- list.files(download_folder, pattern="stundenwerte_TU")

for(i in 1:length(files)) {
  list_of_filenames <- unzip(paste0(download_folder, files[i]), list=TRUE)
  extract_filename <- list_of_filenames[grepl("produkt", list_of_filenames[, "Name"]), "Name"]
  unzip(paste0(download_folder, files[i]), files=extract_filename, exdir=import_folder) # there is a bug in exdir= argument, it can not be ended with slash /
}

  

4. Data Transforming

In this step we will use the famous tidyverse library. This library completely reverse the passive situation in old R . If you only used R around 2011-2015 and do not know "tidyverse", you should complete try new R again.

No wonder in 4.0.2 version(2020-June) of R claimed itself as "Taking Off Again".

4.1 How to read csv(txt) file in R? 

We no longer use base R read.csv() or read.delim(). Most of time now we use read_csv().

Considering it's a list of files, we can read them in one by one with for-loop. Then we can merge them into a big main_df dataframe.

files <- list.files(import_folder, pattern="*")

main_df <- tibble()
for(i in 1:length(files)) {
  
  # Read in the file
  file_path <- str_c(import_folder, files[i])
  df <- read_delim(file_path, delim=";")
 
 #...
 #...
 #...
}

4.2 How to do data cleaning in R?  

 Before merge each df into a main_df, we want to do some data transforming.

1. There are some columns we don't need.

2. There are some columns data-type we want to change.

3. We want to select the date time range.

All this can be done with %>% pipline operation with some function in one step and it is very clear to write and to read. 

select() can select columns. mutate() can add columns with exist or non-exist columns. filter() can filter rows with conditions.

As for date and time, although base R has many date and time functions, most of them are in inconsistency and chaos.

Nowaday we have library "lubridate" which supplys consistency and mordenize date and time functions. We will use lubridate::ydm_h() to transform MESS_DATUM column into decent datetime.

files <- list.files(import_folder, pattern="*")

obsolete_columns <- c("QN_9", "RF_TU", "eor")

main_df <- tibble()
for(i in 1:length(files)) {
  
  # Read in the file
  file_path <- str_c(import_folder, files[i])
  df <- read_delim(file_path, delim=";")
  
  # Prepare the df before merging
  # (drop obsolete, convert data-type, filter to date)
  df <- df %>%
    select(-(obsolete_columns)) %>%
    mutate(STATIONS_ID=as.numeric(STATIONS_ID),
           MESS_DATUM=lubridate::ydm_h(MESS_DATUM),
           TT_TU=as.numeric(TT_TU)) %>%
    filter(MESS_DATUM>="2007-01-01")
  
  # Merge to main_df
  main_df <- rbind(main_df, df)
}

4.3 Step Summary

library(tidyverse)

import_folder <- "./importr/"
output_folder <- "./export_uncleanedr/"
if(!file.exists(output_folder)) {
  dir.create(output_folder)
}

files <- list.files(import_folder, pattern="*")

obsolete_columns <- c("QN_9", "RF_TU", "eor")

main_df <- tibble()
for(i in 1:length(files)) {
  
  # Read in the file
  file_path <- str_c(import_folder, files[i])
  df <- read_delim(file_path, delim=";")
  
  # Prepare the df before merging
  # (drop obsolete, convert data-type, filter to date)
  df <- df %>%
    select(-(obsolete_columns)) %>%
    mutate(STATIONS_ID=as.numeric(STATIONS_ID),
           MESS_DATUM=lubridate::ydm_h(MESS_DATUM),
           TT_TU=as.numeric(TT_TU)) %>%
    filter(MESS_DATUM>="2007-01-01")
  
  # Merge to main_df
  main_df <- rbind(main_df, df)
  
  # Display some status messages
  print(str_c("Finished file: ", files[i], ", which is file number ", i))
  print(str_c("Shape of main_df is: ", dim(main_df)))
}

# Make sure we have no dulplications
main_df <- main_df %>% distinct()

# Save main_df to a .csv file and a pickle to continue working in the next step
main_df %>% write_rds(str_c(output_folder, "/to_clean.rds"))
main_df %>% write_csv(str_c(output_folder, "/to_clean.csv"))

 

5. Final Process  

Firstly, there are some TT_TU column data equals -999, which means unvalid. 

I personally will keep the original column unchanged and add new adjusted columns to main_df. 

Secondly, we want to summary them into daily mean temperature. 

Thirdly, we can use pivot_wider() function to do a pivot table in R.

In last line I also add a date filter, which is not indeed needed. It is only for output table with less NaN values.

library(tidyverse)
library(lubridate)

main_df <- read_rds("./export_uncleanedr/to_clean.rds")

main_df %>%
  mutate(TT_TU_Adj=ifelse(TT_TU==-999, NaN, TT_TU),
         MESS_DATUM_Adj=as_date(MESS_DATUM)) %>%
  pivot_wider(id_cols=MESS_DATUM_Adj, names_from=STATIONS_ID, values_from=TT_TU_Adj, values_fn=mean) %>%
  filter(MESS_DATUM_Adj>="2011-12-31")

Result:

# A tibble: 1,152 x 5
   MESS_DATUM_Adj   `3`  `44`  `73`   `71`
   <date>         <dbl> <dbl> <dbl>  <dbl>
 1 2012-01-01        NA 10.9   4.03  8.14 
 2 2012-02-01        NA  7.41  4.77  6.18 
 3 2012-03-01        NA  6.14  4.46  3.61 
 4 2012-04-01        NA  5.80  4.45  2.48 
 5 2012-05-01        NA  6.63  4.99  2.12 
 6 2012-06-01        NA  5.67  2.28 -0.258
 7 2012-07-01        NA  5.72  1.93  0.996
 8 2012-08-01        NA  5.6   2.91  1.05 
 9 2012-09-01        NA  6.5   2.99  1.42 
10 2012-10-01        NA  5.19  4.47  2.33 
# ... with 1,142 more rows

  

  

 

posted @ 2020-10-24 16:37  DrVonGoosewing  阅读(119)  评论(0)    收藏  举报