How to use column name as function input variable in R?
1. The Problem
I have a dataset and I will do many summaries base on different columns.
These summaries have basically the same calculation process. However we need to choose different column name as target to be grouped or to be pivoted.
I can do them separately but this will cause many code duplicates. Even if I am not too lazy to copy-and-paste them, but they are theoretically bugs waitting to happen(because of the duplicates).
Normally we like to extract these same process into a function outside.
I simulate the problem with classic datase diamonds.
library(tidyverse) diamonds
# A tibble: 53,940 x 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # … with 53,930 more rows
What we don't want to do is this...
diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price)) diamonds %>% group_by(color) %>% summarise(avg_pri = mean(price)) diamonds %>% group_by(clarity) %>% summarise(avg_pri = mean(price))
2. Using a function, but how?
First of all, we may come up with an idea like below. But R will tell you that it is wrong. R will think group is a variable, and it cannot find the varibale called gourp.
# this is wrong
my_summary_test0 <- function(tb, group) {
diamonds %>%
group_by(group) %>%
summarise(avg_pri = mean(price))
}
my_summary_test0(diamonds, cut)
# R:
# Must group by variables found in `.data`.
# * Column `group` is not found.
We notice group_by() function needs a vector as input. So it is not too hard for us to think about using [ ] function to extract vector with column name. Actually it works.
And this method works well even with multiple target waiting to be grouped.
# this can work
my_summary_test1 <- function(tb, tar) {
# tb: a tibble
# tar: a string(characters)
diamonds %>%
group_by(tb[tar]) %>%
summarise(avg_pri = mean(price))
}
my_summary_test1(diamonds, "cut")
# A tibble: 5 x 2 cut avg_pri * <ord> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458.
# this can work
my_summary_test2 <- function(tb, tar1, tar2) {
# tb: a tibble
# tar1, tar2: a string(characters)
diamonds %>%
group_by(tb[tar1], tb[tar2]) %>%
summarise(avg_pri = mean(price))
}
my_summary_test2(diamonds, "cut", "clarity")
# A tibble: 40 x 3 # Groups: cut [5] cut clarity avg_pri <ord> <ord> <dbl> 1 Fair I1 3704. 2 Fair SI2 5174. 3 Fair SI1 4208. 4 Fair VS2 4175. 5 Fair VS1 4165. 6 Fair VVS2 3350. 7 Fair VVS1 3871. 8 Fair IF 1912. 9 Good I1 3597. 10 Good SI2 4580. # … with 30 more rows
But soon I notice something wrong. If I want to make a pivot with pivot_wider() function. (We can try and prove this error is caused by pivot_wider() not pipline.)
# it goes error with pivot_wider()
my_summary_test3 <- function(tb, tar1, tar2) {
# tb: a tibble
# tar1, tar2: a string(characters)
diamonds %>%
group_by(tb[tar1], tb[tar2]) %>%
summarise(avg_pri = mean(price)) %>%
pivot_wider(id_cols = tb[tar1], names_from = tb[tar2], values_from = avg_pri)
}
my_summary_test3(diamonds, "cut", "clarity")
R will tell you, pivot_wider() function needs a <tidy-select> object, but we are giving it a vector. Basicaly these <tidy-select> objects can be created by helper functions like starts_with(), ends_with(), matches(), and so on.
After a little test, I come up with an "tricky" method like below.
my_summary_test4 <- function(tb, tar1, tar2) {
# tb: a tibble
# tar1, tar2: a string(characters)
diamonds %>%
group_by(tb[tar1], tb[tar2]) %>%
summarise(avg_pri = mean(price)) %>%
pivot_wider(id_cols = matches(tar1), names_from = matches(tar2), values_from = avg_pri)
}
# it works
my_summary_test4(diamonds, "cut", "clarity")
# A tibble: 5 x 9 # Groups: cut [5] cut I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Fair 3704. 5174. 4208. 4175. 4165. 3350. 3871. 1912. 2 Good 3597. 4580. 3690. 4262. 3801. 3079. 2255. 4098. 3 Very Good 4078. 4989. 3932. 4216. 3805. 3038. 2459. 4396. 4 Premium 3947. 5546. 4455. 4550. 4485. 3795. 2831. 3856. 5 Ideal 4336. 4756. 3752. 3285. 3490. 3250. 2468. 2273.
This is acutally a "homebrew" method and it looks strange. If you try to use matches() in group_by() function. Now R will tell you group_by() does not need a <tidy-select> object. It needs a vector.
We cannot try all functions in tidyverse, so maybe next time some other function requires for other kind object as input.
Do we have a more offical method to solve these problems?
3. !!as.name()
There is a function called !!as.name(). Use this fuction we can specify columns.
my_summary_test6 <- function(tb, tar1, tar2) {
diamonds %>%
group_by(!!as.name(tar1), !!as.name(tar2)) %>%
summarise(avg_pri = mean(price)) %>%
pivot_wider(id_cols = !!as.name(tar1),
names_from = !!as.name(tar2),
values_from = avg_pri)
}
# correct
my_summary_test6(diamonds, "cut", "clarity")
# A tibble: 5 x 9 # Groups: cut [5] cut I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Fair 3704. 5174. 4208. 4175. 4165. 3350. 3871. 1912. 2 Good 3597. 4580. 3690. 4262. 3801. 3079. 2255. 4098. 3 Very Good 4078. 4989. 3932. 4216. 3805. 3038. 2459. 4396. 4 Premium 3947. 5546. 4455. 4550. 4485. 3795. 2831. 3856. 5 Ideal 4336. 4756. 3752. 3285. 3490. 3250. 2468. 2273.
Another useage of "!!" is you can furtherly specify column name as input variable. This is off topic though, but still good to know and be packed at our backpocket!
my_summary_test7 <- function(tb, tar1, nm) {
diamonds %>%
group_by(!!as.name(tar1)) %>%
summarise(!!nm := mean(price))
}
my_summary_test7(diamonds, "cut", "your_special_column_name")
# A tibble: 5 x 2 cut your_special_column_name * <ord> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458.
This method seems perfect and consistent. If only one thing you will take away from this article, it will be this function !!as.name().
I cannot find other useful infomation about "!!", or does it a part of function name? I will keep tracking and update when I have enough knowledge.
浙公网安备 33010602011771号