How to Import Datasets with Repeatable Filename

1. Introduction

When we have a project which needs to import data repeatly, we bascially have two choices:

a. Luckly we have password to database. We can hardcode SQL sentence in our analysis code. However, most of time that's not the case.

b. We can manually download data from somewhere else, like website, email, IM and so on. 

In this situation, downloaded filename will have a part all the same and a suffix. For example, monthly_202010.csv.

The inconsistency in filename makes us have to change filename in analysis code every time we runs it. Otherwise we can change filename in os each time, not so useful either.

In this article, we will talk about regular expression("regex"). With the help of  it we can build our filename repeatable and selfadapt.

Without refering to wikipedia or offical definition, regular expression is mostly a string.  But not like normal string, it contains some special symbol for fuzzy matching.

For example, "abc" can only equal "abc", but regular expression "ab." can equal to "abc" as well. 

 

2. How to use regular expression?

This is not a complete study about regular expression, so we will focus on how to match filename with regular expression.

2.1 R

We have a base R function called list.files(). It is the same as dir() but just with two names.

This function has two important arguments

1. "pattern=".  It accepts regular expression, which can help us locate the file we need.

2. “full.names=". Set to TRUE so it will return a full path to the file.

(Updated to article at Dec.2020, I didn't notice this argument “full.names=TRUE" at first place. With the help of this one, we don't have to paste a full path to file ourself.)

With the help of list.files() and a little bit of string concat, we can build a selfadpat and repeatable filename in analysis code.

For example:

filename <- list.files('./data/', pattern='monthly')
file <- paste0('./data/', filename)
monthly <- read.csv(file)
...

As long as we have only one "monthly" data in the folder, this code can adapt no matter how it change it's suffix. It works well if we don't care about storing historical data so we delete/replace/transfer the data each time.

However, what about different files with the same name? Like, monthly_202010, monthly_202009?

2.1.1 time

In a more complex situation, if we have same pattern of filename, but only with different suffix. We can build our regular expression with time function easily.

For example, in same data folder we have monthly_202010.csv and monthly_202009.csv. But we only want monthly_202010.csv.

list.files('./data/', pattern='monthly')
# output "monthly_202009.csv" "monthly_202010.csv"

now <- Sys.time() # a string
pattern <- paste('monthly', strftime(now, "%Y%m"), sep='_')
list.files('./data/', pattern=pattern)
# output "monthly_202010.csv"

2.1.2 ^ and $

We have symbol ^ and $ to control where the pattern of filename appears. Use ^ to lock the pattern of filename must appesrs at beginning.

For example, in same data folder we have monthly_202010.csv and bin_monthly.csv. But we only want monthly_202010.csv.

list.files('./data/', pattern='monthly')
# output "bin_monthly.csv"    "monthly_202010.csv"

list.files('./data/', pattern='^monthly')
# output "monthly_202010.csv"

2.1.3 [] and +

The last common used symbol is [] and + and they oftern work together. [] is a collection of how the pattern is build char by char and + means pattern of char in [] can appear one time or more.

For example, in same data folder we have monthly_202010.csv and bin20_monthly.csv. But we only want bin20_monthly.csv.

In this special case we can actually match pattern='monthly.csv', but let's have look at how [] and + works together.

list.files('./data/', pattern='monthly')
# output "bin20_monthly.csv"  "monthly_202010.csv"

list.files('./data/', pattern='[a-z0-9_]+monthly')
# output "bin20_monthly.csv"

2.2 Python

We will use package glob to match filename in Python. This package is ship with Python installation.

glob.glob() does not use rigid regular expression. For example, it has no symbol + but it use * only.

Like example above, in Python we can write as below:

import glob

glob.glob('./data/*')
# output ['./data\\bin20_monthly.csv', './data\\monthly_202010.csv']

glob.glob('./data/monthly*')
# output ['./data\\monthly_202010.csv']

glob.glob('./data/[a-z0-9_]*monthly*')
# output ['./data\\bin20_monthly.csv']

  

3. Detour or Shortcut?

It is easy to notice for the purpose of repeatable filename, we will have extra code in our analysis. This can be a detour if we only use the code one or two times. We are wasting our time to figure which regular expression should we use. In that situation I will like to recommend copy-and-paste the filename directly instead. 

But as I mention in the beginning, if we have a project will repeat itself many times in the future, and the filename is always changing suffix, using regular expression is indeed a shortcut. If you want to build very complex regular expression, I recommond this websit for reference.

 

posted @ 2020-10-03 11:48  DrVonGoosewing  阅读(121)  评论(0)    收藏  举报