How to use column name as function input variable in R?

1. The Problem

I have a dataset and I will do many summaries base on different columns.

These summaries have basically the same calculation process. However we need to choose different column name as target to be grouped or to be pivoted.

I can do them separately but this will cause many code duplicates. Even if I am not too lazy to copy-and-paste them, but they are theoretically bugs waitting to happen(because of the duplicates).

Normally we like to extract these same process into a function outside.

I simulate the problem with classic datase diamonds.

library(tidyverse)

diamonds

# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

What we don't want to do is this...

diamonds %>%
  group_by(cut) %>%
  summarise(avg_pri = mean(price))

diamonds %>%
  group_by(color) %>%
  summarise(avg_pri = mean(price))

diamonds %>%
  group_by(clarity) %>%
  summarise(avg_pri = mean(price))

2. Using a function, but how?

First of all, we may come up with an idea like below. But R will tell you that it is wrong. R will think group is a variable, and it cannot find the varibale called gourp.

# this is wrong
my_summary_test0 <- function(tb, group) {
  diamonds %>%
    group_by(group) %>%
    summarise(avg_pri = mean(price))
}

my_summary_test0(diamonds, cut)

# R:
# Must group by variables found in `.data`.
# * Column `group` is not found.

We notice group_by() function needs a vector as input. So it is not too hard for us to think about using [ ] function to extract vector with column name. Actually it works.

And this method works well even with multiple target waiting to be grouped.

# this can work
my_summary_test1 <- function(tb, tar) {
  # tb: a tibble
  # tar: a string(characters)
  diamonds %>%
    group_by(tb[tar]) %>%
    summarise(avg_pri = mean(price))
}

my_summary_test1(diamonds, "cut")

# A tibble: 5 x 2
  cut       avg_pri
* <ord>       <dbl>
1 Fair        4359.
2 Good        3929.
3 Very Good   3982.
4 Premium     4584.
5 Ideal       3458.

# this can work
my_summary_test2 <- function(tb, tar1, tar2) {
  # tb: a tibble
  # tar1, tar2: a string(characters)
  diamonds %>%
    group_by(tb[tar1], tb[tar2]) %>%
    summarise(avg_pri = mean(price))
}

my_summary_test2(diamonds, "cut", "clarity")

# A tibble: 40 x 3
# Groups:   cut [5]
   cut   clarity avg_pri
   <ord> <ord>     <dbl>
 1 Fair  I1        3704.
 2 Fair  SI2       5174.
 3 Fair  SI1       4208.
 4 Fair  VS2       4175.
 5 Fair  VS1       4165.
 6 Fair  VVS2      3350.
 7 Fair  VVS1      3871.
 8 Fair  IF        1912.
 9 Good  I1        3597.
10 Good  SI2       4580.
# … with 30 more rows

But soon I notice something wrong. If I want to make a pivot with pivot_wider() function. (We can try and prove this error is caused by pivot_wider() not pipline.)

# it goes error with pivot_wider()
my_summary_test3 <- function(tb, tar1, tar2) {
  # tb: a tibble
  # tar1, tar2: a string(characters)
  diamonds %>%
    group_by(tb[tar1], tb[tar2]) %>%
    summarise(avg_pri = mean(price)) %>%
    pivot_wider(id_cols = tb[tar1], names_from = tb[tar2], values_from = avg_pri)
}

my_summary_test3(diamonds, "cut", "clarity")

R will tell you, pivot_wider() function needs a <tidy-select> object, but we are giving it a vector. Basicaly these <tidy-select> objects can be created by helper functions like starts_with(), ends_with(), matches(), and so on.

After a little test, I come up with an "tricky" method like below.

my_summary_test4 <- function(tb, tar1, tar2) {
  # tb: a tibble
  # tar1, tar2: a string(characters)
  diamonds %>%
    group_by(tb[tar1], tb[tar2]) %>%
    summarise(avg_pri = mean(price)) %>%
    pivot_wider(id_cols = matches(tar1), names_from = matches(tar2), values_from = avg_pri)
}

# it works
my_summary_test4(diamonds, "cut", "clarity")

# A tibble: 5 x 9
# Groups:   cut [5]
  cut          I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair      3704. 5174. 4208. 4175. 4165. 3350. 3871. 1912.
2 Good      3597. 4580. 3690. 4262. 3801. 3079. 2255. 4098.
3 Very Good 4078. 4989. 3932. 4216. 3805. 3038. 2459. 4396.
4 Premium   3947. 5546. 4455. 4550. 4485. 3795. 2831. 3856.
5 Ideal     4336. 4756. 3752. 3285. 3490. 3250. 2468. 2273.

This is acutally a "homebrew" method and it looks strange. If you try to use matches() in group_by() function. Now R will tell you group_by() does not need a <tidy-select> object. It needs a vector.

We cannot try all functions in tidyverse, so maybe next time some other function requires for other kind object as input.

Do we have a more offical method to solve these problems?

3. !!as.name()
There is a function called !!as.name(). Use this fuction we can specify columns.

my_summary_test6 <- function(tb, tar1, tar2) {
  diamonds %>%
    group_by(!!as.name(tar1), !!as.name(tar2)) %>%
    summarise(avg_pri = mean(price)) %>%
    pivot_wider(id_cols = !!as.name(tar1), 
                names_from = !!as.name(tar2), 
                values_from = avg_pri)
}

# correct
my_summary_test6(diamonds, "cut", "clarity")

# A tibble: 5 x 9
# Groups:   cut [5]
  cut          I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair      3704. 5174. 4208. 4175. 4165. 3350. 3871. 1912.
2 Good      3597. 4580. 3690. 4262. 3801. 3079. 2255. 4098.
3 Very Good 4078. 4989. 3932. 4216. 3805. 3038. 2459. 4396.
4 Premium   3947. 5546. 4455. 4550. 4485. 3795. 2831. 3856.
5 Ideal     4336. 4756. 3752. 3285. 3490. 3250. 2468. 2273.

Another useage of "!!" is you can furtherly specify column name as input variable. This is off topic though, but still good to know and be packed at our backpocket!

my_summary_test7 <- function(tb, tar1, nm) {
  diamonds %>%
    group_by(!!as.name(tar1)) %>%
    summarise(!!nm := mean(price))
}

my_summary_test7(diamonds, "cut", "your_special_column_name")

# A tibble: 5 x 2
  cut       your_special_column_name
* <ord>                        <dbl>
1 Fair                         4359.
2 Good                         3929.
3 Very Good                    3982.
4 Premium                      4584.
5 Ideal                        3458.

This method seems perfect and consistent. If only one thing you will take away from this article, it will be this function !!as.name().

I cannot find other useful infomation about "!!", or does it a part of function name? I will keep tracking and update when I have enough knowledge.

posted @ 2021-03-14 11:14 DrVonGoosewing 阅读(140) 评论(0) 收藏举报

刷新页面返回顶部

How to use column name as function input variable in R?

公告