It is with great excitement that we announce the release of rlang 0.4.0 on CRAN. rlang is a toolkit for working with core R and Tidyverse features, and hosts the tidy evaluation framework. The full set of changes can be found in the changelog.
In this article, we introduce the most important of these, the new tidy evaluation operator {{
. We will use a simple dplyr pipeline as a running example, let’s start by attaching the package:
library(dplyr)
The good and bad of tidy evaluation
Tidy eval powers packages like dplyr and tidyr. It makes it possible to manipulate data frame columns as if they were defined in the workspace:
gender
#> Error in eval(expr, envir, enclos): object 'gender' not found
mass
#> Error in eval(expr, envir, enclos): object 'mass' not found
starwars %>%
group_by(gender) %>%
summarise(mass_maximum = max(mass, na.rm = TRUE))
#> # A tibble: 5 x 2
#> gender mass_maximum
#> <chr> <dbl>
#> 1 <NA> 75
#> 2 female 75
#> 3 hermaphrodite 1358
#> 4 male 159
#> 5 none 140
We call this syntax data masking. This feature is unique to the R language and greatly streamlines the writing and reading of code in interactive scripts. Unfortunately, it also makes it more complex to reuse common patterns inside functions:
max_by <- function(data, var, by) {
data %>%
group_by(by) %>%
summarise(maximum = max(var, na.rm = TRUE))
}
starwars %>% max_by(mass, by = gender)
#> Error: Column `by` is unknown
Technically, this is because data-masked code needs to be delayed and transported to the data context. Behind the scenes, dplyr verbs achieve this by capturing the blueprint of your code, and resuming its evaluation inside the data mask. The example above fails because group_by()
is capturing the wrong piece of blueprint. To solve this, tidy evaluation provides enquo()
to delay the interpretation of code and capture its blueprint, and the surgery operator !!
for modifying blueprints. The combination of using enquo()
and !!
is called the quote-and-unquote pattern:
max_by <- function(data, var, by) {
data %>%
group_by(!!enquo(by)) %>%
summarise(maximum = max(!!enquo(var), na.rm = TRUE))
}
starwars %>% max_by(mass, by = gender)
#> # A tibble: 5 x 2
#> gender maximum
#> <chr> <dbl>
#> 1 <NA> 75
#> 2 female 75
#> 3 hermaphrodite 1358
#> 4 male 159
#> 5 none 140
We have come to realise that this pattern is difficult to teach and to learn because it involves a new, unfamiliar syntax, and because it introduces two new programming concepts (quote and unquote) that are hard to understand intuitively. This complexity is not really justified because this pattern is overly flexible for basic programming needs.
A simpler interpolation pattern with {{
rlang 0.4.0 provides a new operator, {{
(read: curly curly), which abstracts quote-and-unquote into a single interpolation step. The curly-curly operator should be straightforward to use. When you create a function around a tidyverse pipeline, wrap the function arguments containing data frame variables with {{
:
max_by <- function(data, var, by) {
data %>%
group_by({{ by }}) %>%
summarise(maximum = max({{ var }}, na.rm = TRUE))
}
starwars %>% max_by(height)
#> # A tibble: 1 x 1
#> maximum
#> <int>
#> 1 264
starwars %>% max_by(height, by = gender)
#> # A tibble: 5 x 2
#> gender maximum
#> <chr> <int>
#> 1 <NA> 167
#> 2 female 213
#> 3 hermaphrodite 175
#> 4 male 264
#> 5 none 200
This syntax should be reminiscent of string interpolation in the glue package by Jim Hester:
var <- sample(c("woof", "meow", "mooh"), size = 1)
glue::glue("Did you just say {var}?")
#> Did you just say mooh?
Other simple tidy evaluation patterns
There are a few existing patterns that aren’t emphasised enough in the existing documentation. We are changing our teaching strategy to focus on these simpler patterns.
-
If you would like to pass multiple arguments to a data-masking verb, pass
...
directly:summarise_by <- function(data, ..., by) { data %>% group_by({{ by }}) %>% summarise(...) } starwars %>% summarise_by( average = mean(height, na.rm = TRUE), maximum = max(height, na.rm = TRUE), by = gender ) #> # A tibble: 5 x 3 #> gender average maximum #> <chr> <dbl> <int> #> 1 <NA> 120 167 #> 2 female 165. 213 #> 3 hermaphrodite 175 175 #> 4 male 179. 264 #> 5 none 200 200
You only need quote-and-unquote (with the plural variants
enquos()
and!!!
) when you need to modify the inputs or their names in some way. -
If you have string inputs, use the
.data
pronoun:max_by <- function(data, var, by) { data %>% group_by(.data[[by]]) %>% summarise(maximum = max(.data[[var]], na.rm = TRUE)) } starwars %>% max_by("height", by = "gender") #> # A tibble: 5 x 2 #> gender maximum #> <chr> <int> #> 1 <NA> 167 #> 2 female 213 #> 3 hermaphrodite 175 #> 4 male 264 #> 5 none 200
The
.
pronoun from magrittr is not appropriate here because it represents the whole data frame, whereas.data
represents the subset for the current group.
To learn more about the different ways of programming around tidyverse pipelines, we recommend reading the new programming vignette in ggplot2, written by Dewey Dunnington who is currently interning at RStudio.
Thanks!
The following people have contributed to this release by posting issues and pull requests:
@001ben, @asardaes, @BillDunlap, @burchill, @cpsievert, @DavisVaughan, @egnha, @flying-sheep, @gaborcsardi, @gaelledoucet, @GaGaMan1101, @grayskripko, @hadley, @harrysouthworth, @holgerbrandl, @IndrajeetPatil, @jazzmoe, @jennybc, @jjesusfilho, @juangomezduaso, @krlmlr, @lionel-, @Marieag, @mmuurr, @moodymudskipper, @paulponcet, @riccardopinosio, @richierocks, @RolandASc, @romainfrancois, @s-fleck, @siddharthprabhu, @subratiter1, @wch, @wetlandscapes, @wlandau, @x1o, @XWeiZhou, @yenzichun, @yonicd, and @zachary-foster