This post, published in early December 2018 and promoted on Twitter generated valuable discussions that led us to reconsider some design choices for
dplyr
0.8.0
We’ve left the original post unchanged, with addenda when changes have been made.
A new release of dplyr (0.8.0) is on the horizon, roughly planned for early January planned
for February 1st.
Since it is a major release with some potential disruption, we’d love for the community to try it out, give us some feedback, and report issues before we submit to CRAN. This version represents about nine months of development, making dplyr more respectful of factors, and less surprising in its evaluation of expressions.
In this post, we’ll highlight the major changes. Please see the NEWS for a more detailed description of changes. Our formalised process for this release is captured in this issue.
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr@rc_0.8.0")
If needed, you can restore the release version by installing from CRAN:
install.packages("dplyr")
New grouping algorithm
Group creation
The algorithm behind
group_by()
has been redesigned to better respect factor levels,
so that a group is created for each level of the factor, even if there is no data. This
differs from previous versions of dplyr where groups were only created to
match the observed data. This closes the epic issue
341, which dates back to 2014, and has generated
a lot of press and frustration, see
Zero Counts in dplyr
for a recent walkthrough of the issue.
Let’s illustrate the new algorithm with the
count()
function:
df <- tibble(
f1 = factor(c("a", "a", "a", "b", "b"), levels = c("a", "b", "c")),
f2 = factor(c("d", "e", "d", "e", "f"), levels = c("d", "e", "f")),
x = c(1, 1, 1, 2, 2),
y = 1:5
)
df
#> # A tibble: 5 x 4
#> f1 f2 x y
#> <fct> <fct> <dbl> <int>
#> 1 a d 1 1
#> 2 a e 1 2
#> 3 a d 1 3
#> 4 b e 2 4
#> 5 b f 2 5
df %>%
count(f1)
#> # A tibble: 3 x 2
#> f1 n
#> <fct> <int>
#> 1 a 3
#> 2 b 2
#> 3 c 0
Where previous versions of dplyr
would have created only two groups (for levels a
and b
),
it now creates one group per level, and the group related to the level c
just happens to be
empty.
Groups are still made to match the data on other types of columns:
df %>%
count(x)
#> # A tibble: 2 x 2
#> x n
#> <dbl> <int>
#> 1 1 3
#> 2 2 2
Expansion of groups for factors happens at each step of the grouping, so if we group
by f1
and f2
we get 9 groups,
df %>%
count(f1, f2)
#> # A tibble: 9 x 3
#> f1 f2 n
#> <fct> <fct> <int>
#> 1 a d 2
#> 2 a e 1
#> 3 a f 0
#> 4 b d 0
#> 5 b e 1
#> 6 b f 1
#> 7 c d 0
#> 8 c e 0
#> 9 c f 0
When factors and non factors are involved in the grouping, the number of groups depends on the order. At each level of grouping, factors are always expanded to one group per level, but non factors only create groups based on observed data.
df %>%
count(f1, x)
#> # A tibble: 3 x 3
#> f1 x n
#> <fct> <dbl> <int>
#> 1 a 1 3
#> 2 b 2 2
#> 3 c NA 0
In this example, we group by f1
then x
. At the first layer, grouping on f1
creates
three groups. Each of these groups is then subdivided based on the values of the second
variable x
. Since x
is always 1 when f1
is a
the group is not
further divided.
The last group, associated with the level c
of the factor f1
is empty, and
consequently has no values for the vector x
. In that case,
group_by()
uses
NA
.
df %>%
count(x, f1)
#> # A tibble: 6 x 3
#> x f1 n
#> <dbl> <fct> <int>
#> 1 1 a 3
#> 2 1 b 0
#> 3 1 c 0
#> 4 2 a 0
#> 5 2 b 2
#> 6 2 c 0
When we group by x
then f1
we initially split the data according to x
which
gives 2 groups. Each of these two groups is then further divided in 3 groups,
i.e. one for each level of f1
.
We’ve added the possibility to drop the empty groups, and hence get the previous behaviour by using
group_by(.drop = TRUE)
.This is not the default value, because we still strongly believe that all levels of factors should be represented in the grouping structure.
Group preservation
The grouping structure is more coherently preserved by dplyr verbs.
df %>%
group_by(x, f1) %>%
summarise(y = mean(y))
#> # A tibble: 6 x 3
#> # Groups: x [2]
#> x f1 y
#> <dbl> <fct> <dbl>
#> 1 1 a 2
#> 2 1 b NaN
#> 3 1 c NaN
#> 4 2 a NaN
#> 5 2 b 4.5
#> 6 2 c NaN
The expression mean(y)
is evaluated for the empty groups as well, and gives
consistent results with :
mean(numeric())
#> [1] NaN
In particular the result of
filter()
preserves the grouping structure of the input
data frame.
df %>%
group_by(x, f1) %>%
filter(y < 4)
#> # A tibble: 3 x 4
#> # Groups: x, f1 [3]
#> f1 f2 x y
#> <fct> <fct> <dbl> <int>
#> 1 a d 1 1
#> 2 a e 1 2
#> 3 a d 1 3
The resulting tibble after the
filter()
call has six groups, the same
exact groups that were made by
group_by()
. Previous versions of dplyr
would perform an implicit group_by()
after the filtering, potentially losing
groups.
Because this is potentially disruptive,
filter()
has gained a .preserve
argument,
when .preserve
is FALSE
the data is first filtered and then regrouped:
df %>%
group_by(x, f1) %>%
filter(y < 5, .preserve = FALSE)
#> # A tibble: 4 x 4
#> # Groups: x, f1 [6]
#> f1 f2 x y
#> <fct> <fct> <dbl> <int>
#> 1 a d 1 1
#> 2 a e 1 2
#> 3 a d 1 3
#> 4 b e 2 4
As opposed to what is described above, feedback from this post led us to change the default value of
.preserve
toFALSE
, and update the algorithm to limit the cost of preserving.
Note however, that even .preserve = FALSE
respects the factors that are used as
grouping variables, in particular filter( , .preserve = FALSE)
is not a way to
discard empty groups. The
forcats 📦 may help:
iris %>%
group_by(Species) %>%
filter(stringr::str_detect(Species, "^v")) %>%
ungroup() %>%
group_by(Species = forcats::fct_drop(Species))
#> # A tibble: 100 x 5
#> # Groups: Species [2]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 90 more rows
Furthermore, the
group_trim()
function has been added.group_trim()
recalculates the grouping metadata after dropping unused levels for all grouping variables that are factors.
iris %>%
group_by(Species) %>%
filter(stringr::str_detect(Species, "^v")) %>%
group_trim()
#> # A tibble: 100 x 5
#> # Groups: Species [2]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 90 more rows
New grouping fuctions
The grouping family is extended with new functions:
-
group_nest()
: similar totidyr::nest()
but focusing on the grouping columns rather than the columns to nest -
group_split()
: similar tobase::split()
but the grouping is subject to the data mask -
group_keys()
: retrieves a tibble with one row per group and one column per grouping variable -
group_rows()
: retrieves a list of 1-based integer vectors, each vector represents the indices of the group in the grouped data frame
The primary use case for these functions is with already grouped data frames, that may directly
or indirectly originate from
group_by()
.
data <- iris %>%
group_by(Species) %>%
filter(Sepal.Length > mean(Sepal.Length))
data %>%
group_nest()
#> # A tibble: 3 x 2
#> Species data
#> <fct> <list>
#> 1 setosa <tibble [22 × 4]>
#> 2 versicolor <tibble [24 × 4]>
#> 3 virginica <tibble [22 × 4]>
data %>%
group_split()
#> [[1]]
#> # A tibble: 22 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 5.4 3.9 1.7 0.4 setosa
#> 3 5.4 3.7 1.5 0.2 setosa
#> 4 5.8 4 1.2 0.2 setosa
#> 5 5.7 4.4 1.5 0.4 setosa
#> 6 5.4 3.9 1.3 0.4 setosa
#> 7 5.1 3.5 1.4 0.3 setosa
#> 8 5.7 3.8 1.7 0.3 setosa
#> 9 5.1 3.8 1.5 0.3 setosa
#> 10 5.4 3.4 1.7 0.2 setosa
#> # … with 12 more rows
#>
#> [[2]]
#> # A tibble: 24 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 6.5 2.8 4.6 1.5 versicolor
#> 5 6.3 3.3 4.7 1.6 versicolor
#> 6 6.6 2.9 4.6 1.3 versicolor
#> 7 6 2.2 4 1 versicolor
#> 8 6.1 2.9 4.7 1.4 versicolor
#> 9 6.7 3.1 4.4 1.4 versicolor
#> 10 6.2 2.2 4.5 1.5 versicolor
#> # … with 14 more rows
#>
#> [[3]]
#> # A tibble: 22 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7.1 3 5.9 2.1 virginica
#> 2 7.6 3 6.6 2.1 virginica
#> 3 7.3 2.9 6.3 1.8 virginica
#> 4 6.7 2.5 5.8 1.8 virginica
#> 5 7.2 3.6 6.1 2.5 virginica
#> 6 6.8 3 5.5 2.1 virginica
#> 7 7.7 3.8 6.7 2.2 virginica
#> 8 7.7 2.6 6.9 2.3 virginica
#> 9 6.9 3.2 5.7 2.3 virginica
#> 10 7.7 2.8 6.7 2 virginica
#> # … with 12 more rows
data %>%
group_keys()
#> # A tibble: 3 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 versicolor
#> 3 virginica
data %>%
group_rows()
#> [[1]]
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
#>
#> [[2]]
#> [1] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#> [24] 46
#>
#> [[3]]
#> [1] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
Alternatively, these functions may be used on an ungrouped data frame, together with a
grouping specification that is subject to the data mask. In that case, the grouping is
implicitly performed by
group_by()
:
iris %>%
group_nest(Species)
#> # A tibble: 3 x 2
#> Species data
#> <fct> <list>
#> 1 setosa <tibble [50 × 4]>
#> 2 versicolor <tibble [50 × 4]>
#> 3 virginica <tibble [50 × 4]>
iris %>%
group_split(Species)
#> [[1]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 40 more rows
#>
#> [[2]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 40 more rows
#>
#> [[3]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # … with 40 more rows
iris %>%
group_keys(Species)
#> # A tibble: 3 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 versicolor
#> 3 virginica
These functions are related to each other in how they handle and organize the grouping information and who/what is responsible for maintaining the relation between the data and the groups.
-
A grouped data frame, as generated by
group_by()
stores the grouping information as an attribute of the data frame, dplyr verbs use that information to maintain the relationship -
When using
group_nest()
the data is structured as a data frame that has a list column to hold the non grouping columns. The result ofgroup_nest()
is not a grouped data frame, therefore the structure of the data frame maintains the relationship. -
When using
group_split()
the data is split into a list, and each element of the list contains a tibble with the rows of the associated group. The user is responsible to maintain the relationship, and may benefit from the assistance of thegroup_keys()
function, especially in the presence of empty groups.
Iterate on grouped tibbles by group
The new
group_map()
function provides a purrr style function that can be used to
iterate on grouped tibbles. Each conceptual group of the data frame is exposed to the
function with two pieces of information:
- The subset of the data for the group, exposed as
.x
. - The key, a tibble with exactly one row and columns for each grouping variable,
exposed as
.y
mtcars %>%
group_by(cyl) %>%
group_map(~ head(.x, 2L))
#> # A tibble: 6 x 11
#> # Groups: cyl [3]
#> cyl mpg disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 4 24.4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 6 21 160 110 3.9 2.62 16.5 0 1 4 4
#> 4 6 21 160 110 3.9 2.88 17.0 0 1 4 4
#> 5 8 18.7 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 8 14.3 360 245 3.21 3.57 15.8 0 0 3 4
mtcars %>%
group_by(cyl) %>%
group_map(~ tibble(mod = list(lm(mpg ~ disp, data = .x))))
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl mod
#> * <dbl> <list>
#> 1 4 <S3: lm>
#> 2 6 <S3: lm>
#> 3 8 <S3: lm>
The lambda function must return a data frame.
group_map()
row binds the data
frames, recycles the grouping columns and structures the result as a grouped tibble.
group_walk()
can be used when iterating on the groups is only desired for side effects. It applies the formula to each group, and then silently returns its input.
Changes in filter and slice
Besides changes described previously related to preservation of the grouping structure,
filter()
and
slice()
now reorganize the data by groups for performance reasons:
tibble(
x = c(1, 2, 1, 2, 1),
y = c(1, 2, 3, 4, 5)
) %>%
group_by(x) %>%
filter(y < 5)
#> # A tibble: 4 x 2
#> # Groups: x [2]
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 2 2
#> 3 1 3
#> 4 2 4
This has been reverted for
filter()
due to popular demand. Callingfilter()
on a grouped data frame leaves the rows in the original order.
Redesigned hybrid evaluation
What’s hybrid evaluation again ?
Hybrid evaluation is used in
summarise()
and
mutate()
to replace
potential expensive R operations by native C++ code that is group aware.
iris %>%
group_by(Species) %>%
summarise(Petal.Length = mean(Petal.Length))
#> # A tibble: 3 x 2
#> Species Petal.Length
#> <fct> <dbl>
#> 1 setosa 1.46
#> 2 versicolor 4.26
#> 3 virginica 5.55
In the example, the base::mean()
function is never called because the
hybrid alternative can directly calculate the mean for each group. Hybrid
evaluation typically gives better performance because it needs fewer memory
allocations.
In this example, a standard evaluation path would need to:
- create subsets of the
Petal.Length
column for each group - call the
base::mean()
function on each subset, which would also imply a cost for S3 dispatching to the right method - collect all results in a new vector
In constrast, hybrid evaluation can directly allocate the final vector, and calculate all 3 means without having to allocate the subsets.
Flaws in previous version
Previous versions of hybrid evaluation relied on folding to
replace part of the expression by their hybrid result. For example,
there are hybrid versions of sum()
and n()
, so previous
versions attempted to use them for:
iris %>%
group_by(Species) %>%
summarise(Petal.Length = sum(Petal.Length) / n())
#> # A tibble: 3 x 2
#> Species Petal.Length
#> <fct> <dbl>
#> 1 setosa 1.46
#> 2 versicolor 4.26
#> 3 virginica 5.55
The gain of replacing parts of the expression with the result of the hybrid versions was minimal, and the we had to rely on brittle heuristics to try to respect standard R evaluation semantics.
New implementation
The new hybrid system is stricter and falls back to standard R evaluation when the expression is not entirely recognized.
The
hybrid_call()
function (subject to change) can be used to test if an expression
would be handled by hybrid or standard evaluation:
iris %>% hybrid_call(mean(Sepal.Length))
#> <hybrid evaluation>
#> call : base::mean(Sepal.Length)
#> C++ class : dplyr::hybrid::internal::SimpleDispatchImpl<14, false, dplyr::NaturalDataFrame, dplyr::hybrid::internal::MeanImpl>
iris %>% hybrid_call(sum(Sepal.Length) / n())
#> <standard evaluation>
#> call : sum(Sepal.Length)/n()
iris %>% hybrid_call(+mean(Sepal.Length))
#> <standard evaluation>
#> call : +mean(Sepal.Length)
Hybrid is very picky about what it can handle, for example TRUE
and FALSE
are fine for na.rm=
because they are reserved words that can’t be replaced, but
T
, F
or any expression that would resolve to a scalar logical are not:
iris %>% hybrid_call(mean(Sepal.Length, na.rm = TRUE))
#> <hybrid evaluation>
#> call : base::mean(Sepal.Length, na.rm = TRUE)
#> C++ class : dplyr::hybrid::internal::SimpleDispatchImpl<14, true, dplyr::NaturalDataFrame, dplyr::hybrid::internal::MeanImpl>
iris %>% hybrid_call(mean(Sepal.Length, na.rm = T))
#> <standard evaluation>
#> call : mean(Sepal.Length, na.rm = T)
iris %>% hybrid_call(mean(Sepal.Length, na.rm = 1 == 1))
#> <standard evaluation>
#> call : mean(Sepal.Length, na.rm = 1 == 1)
The first step of the new hybrid system consists of studying the expression and compare it to known expression patterns. If we find an exact match, then we have all the information we need, and R is never called to materialize the result.
When there is no match, the expression gets evaluated for each group using R standard
evaluation rules in the data mask: a special environment that makes the
columns available and uses contextual information for functions such as
n()
and
row_number()
.
iris %>%
group_by(Species) %>%
summarise(Petal.Length = sum(Petal.Length) / n())
#> # A tibble: 3 x 2
#> Species Petal.Length
#> <fct> <dbl>
#> 1 setosa 1.46
#> 2 versicolor 4.26
#> 3 virginica 5.55
Performance
When
summarise()
or
mutate()
use expressions that cannot be handled by
hybrid evaluation, they call back to R from the C++ internals for each group.
This is an expensive operation because the expressions have to be evaluated
with extra care. Traditionally it meant wrapping the expression in an R tryCatch()
before evaluating, but R 3.5.0 has added unwind protection which we
exposed to
Rcpp. Consequently, the cost of evaluating an R expression carefully is lower
than before.
We ran a benchmark calculating the means of 10,000 small groups with the release version of dplyr and this release candidate with and without using the unwind protect feature.
Just using the mean()
function would not illustrate the feature, because dplyr would
use hybrid evaluation and never use callbacks to R. So instead we defined a mean_
function that has the same body as base::mean()
. We also compare this to
the expression sum(x) / n()
because it woudld have been handled by
partial hybrid evaluation in previous versions.
This is not a comprehensive benchmark analysis, but on this small example we can read:
- unwind protection has no impact when using the hybrid evaluation, this is not a surprise because the hybrid path does not call back to R.
- hybrid evaluation performs better on the release candidate. This is a direct consequence of the redesign of hybrid evaluation.
- unwind protection gives a performance boost
mean_()
. Please note that the x axis is on a log scale. - unwind protection more than compensates for no longer using partial hybrid evaluation.
nest_join
The
nest_join()
function is the newest addition to the join family.
band_members %>%
nest_join(band_instruments)
#> Joining, by = "name"
#> # A tibble: 3 x 3
#> name band band_instruments
#> * <chr> <chr> <list>
#> 1 Mick Stones <tibble [0 × 1]>
#> 2 John Beatles <tibble [1 × 1]>
#> 3 Paul Beatles <tibble [1 × 1]>
A nest join of x
and y
returns all rows and all columns from x
, plus an additional column
that contains a list of tibbles. Each tibble contains all the rows from y
that match that row of x
.
When there is no match, the list column is a 0-row tibble with the same column names and types as y
.
nest_join()
is the most fundamental join since you can recreate the other joins from it:
-
inner_join()
is anest_join()
plus antidyr::unnest()
-
left_join()
is anest_join()
plus antidyr::unnest()
withdrop=TRUE
-
semi_join()
is anest_join()
plus afilter()
where you check that every element of data has at least one row. -
anti_join()
is anest_join()
plus afilter()
where you check every element has zero rows.
Scoped variants
The scoped (or colwise) verbs are the set of verbs with _at
, _if
and _all
suffixes.
These verbs apply a certain behaviour (for instance, a mutating or summarising operation) to a given
selection of columns. This release of dplyr improves the consistency of the syntax and the behaviour with grouped tibbles.
A purrr-like syntax for passing functions
In dplyr 0.8.0, we have implemented support for functions and purrr-style lambda functions:
iris <- as_tibble(iris) # For concise print method
mutate_if(iris, is.numeric, ~ . - mean(.))
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.743 0.443 -2.36 -0.999 setosa
#> 2 -0.943 -0.0573 -2.36 -0.999 setosa
#> 3 -1.14 0.143 -2.46 -0.999 setosa
#> 4 -1.24 0.0427 -2.26 -0.999 setosa
#> 5 -0.843 0.543 -2.36 -0.999 setosa
#> 6 -0.443 0.843 -2.06 -0.799 setosa
#> 7 -1.24 0.343 -2.36 -0.899 setosa
#> 8 -0.843 0.343 -2.26 -0.999 setosa
#> 9 -1.44 -0.157 -2.36 -0.999 setosa
#> 10 -0.943 0.0427 -2.26 -1.10 setosa
#> # … with 140 more rows
And lists of functions and purrr-style lambda functions:
fns <- list(
centered = mean, # Function object
scaled = ~ . - mean(.) / sd(.) # Purrr-style lambda
)
mutate_if(iris, is.numeric, fns)
#> # A tibble: 150 x 13
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows, and 8 more variables: Sepal.Length_centered <dbl>,
#> # Sepal.Width_centered <dbl>, Petal.Length_centered <dbl>,
#> # Petal.Width_centered <dbl>, Sepal.Length_scaled <dbl>,
#> # Sepal.Width_scaled <dbl>, Petal.Length_scaled <dbl>,
#> # Petal.Width_scaled <dbl>
This is now the preferred syntax for passing functions to the scoped verbs because it is simpler and consistent with purrr.
Counting from dplyr 0.8.0, the hybrid evaluator recognises and inlines these lambdas, so that native implementation of
common algorithms will kick in just as it did with expressions passed with funs()
.
Consequently, we are soft-deprecating funs()
: it will continue to work without any warnings for now,
but will eventually start issuing warnings.
Behaviour with grouped tibbles
We have reviewed the documentation of all scoped variants to make clear how the scoped operations are applied to grouped tibbles. For most of the scoped verbs, the operations also apply on the grouping variables when they are part of the selection. This includes:
-
arrange_all()
,arrange_at()
, andarrange_if()
-
distinct_all()
,distinct_at()
, anddistinct_if()
-
filter_all()
,filter_at()
, andfilter_if()
-
group_by_all()
,group_by_at()
, andgroup_by_if()
-
select_all()
,select_at()
, andselect_if()
This is not the case for summarising and mutating variants where operations are not applied on grouping variables.
The behaviour depends on whether the selection is implicit (all
and if
selections) or explicit (at
selections).
Grouping variables covered by explicit selections (with
summarise_at()
,
mutate_at()
, and
transmute_at()
are always an error.
For implicit selections, the grouping variables are always ignored. In this case, the level of verbosity depends on the kind of operation:
-
Summarising operations (
summarise_all()
andsummarise_if()
ignore grouping variables silently because it is obvious that operations are not applied on grouping variables. -
On the other hand, it isn’t as obvious in the case of mutating operations (
mutate_all()
,mutate_if()
,transmute_all()
, andtransmute_if()
). For this reason, they issue a message indicating which grouping variables are ignored.
In order to make it easier to explicitly remove the grouping columns from an _at
selection, we have introduced a
new selection helper
group_cols()
. Just like
last_col()
matches the last column of a tibble,
group_cols()
matches all grouping columns:
mtcars %>%
group_by(cyl) %>%
select(group_cols())
#> # A tibble: 32 x 1
#> # Groups: cyl [3]
#> cyl
#> * <dbl>
#> 1 6
#> 2 6
#> 3 4
#> 4 6
#> 5 8
#> 6 6
#> 7 8
#> 8 4
#> 9 4
#> 10 6
#> # … with 22 more rows
This new helper is mostly intended for selection in scoped variants:
mtcars %>%
group_by(cyl) %>%
mutate_at(
vars(starts_with("c")),
~ . - mean(.)
)
#> Error: Column `cyl` can't be modified because it's a grouping variable
It makes it easy to remove explicitly the grouping variables:
mtcars %>%
group_by(cyl) %>%
mutate_at(
vars(starts_with("c"), -group_cols()),
~ . - mean(.)
)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 0.571
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 0.571
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 -0.545
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 -2.43
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 -1.5
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 -2.43
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 0.5
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 0.455
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 0.455
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 0.571
#> # … with 22 more rows