Today we are going to look at one of the major new features in
dplyr 1.1.0, per-operation grouping with
.by
/by
. Per-operation grouping is an experimental alternative to
group_by()
which is only active within a single dplyr verb. This is another of the new dplyr features that was inspired by
data.table, this time by their own grouping syntax with by
.
To see the other blog posts in this series, head here.
You can install it from CRAN with:
install.packages("dplyr")
Persistent grouping with group_by()
In dplyr, grouping radically affects the computation of the verb that you use it with. Since the very beginning of dplyr, you’ve been able to perform grouped operations by modifying your data frame with
group_by()
. This grouping is persistent, meaning that it typically sticks around in some form for more than one operation. As an example, take a look at this transactions
dataset which tracks revenue brought in from various transactions across multiple companies. If we wanted to add a column for the total yearly revenue per company, we might do:
transactions <- tibble(
company = c("A", "A", "A", "B", "B", "B"),
year = c(2019, 2019, 2020, 2021, 2023, 2023),
revenue = c(20, 50, 4, 10, 12, 18)
)
transactions
#> # A tibble: 6 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2019 20
#> 2 A 2019 50
#> 3 A 2020 4
#> 4 B 2021 10
#> 5 B 2023 12
#> 6 B 2023 18
transactions |>
group_by(company, year) |>
mutate(total = sum(revenue))
#> # A tibble: 6 × 4
#> # Groups: company, year [4]
#> company year revenue total
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 2019 20 70
#> 2 A 2019 50 70
#> 3 A 2020 4 4
#> 4 B 2021 10 10
#> 5 B 2023 12 30
#> 6 B 2023 18 30
Notice that the result is still grouped by both company
and year
. This is useful if you need to follow up with additional grouped operations (with the exact same grouping columns), but many people follow this
mutate()
with an
ungroup()
.
If we only need the totals, we could also use
summarise()
, which peels off 1 layer of grouping by default:
transactions |>
group_by(company, year) |>
summarise(total = sum(revenue))
#> `summarise()` has grouped output by 'company'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups: company [2]
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2019 70
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 30
Here the grouping of the output isn’t exactly the same as the input, but we still consider this persistent grouping because some of the groups outlive the verb they were used with.
Per-operation grouping with .by
/by
In dplyr 1.1.0, we’ve added an alternative to
group_by()
known as
.by
that introduces the idea of per-operation grouping:
transactions |>
mutate(total = sum(revenue), .by = c(company, year))
#> # A tibble: 6 × 4
#> company year revenue total
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 2019 20 70
#> 2 A 2019 50 70
#> 3 A 2020 4 4
#> 4 B 2021 10 10
#> 5 B 2023 12 30
#> 6 B 2023 18 30
transactions |>
summarise(total = sum(revenue), .by = c(company, year))
#> # A tibble: 4 × 3
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2019 70
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 30
There are a few things about .by
worth noting:
-
The result is always ungrouped, regardless of the number of grouping columns. With
.by
, you never need to remember to callungroup()
. -
We used tidyselect to group by multiple columns.
-
summarise()
didn’t emit a message about regrouping.
One of the things we like about .by
is that it allows you to place the grouping specification alongside the code that uses it, rather than in a separate
group_by()
line. This idea was inspired by data.table’s grouping syntax, which looks like:
transactions[, .(total = sum(revenue)), by = .(company, year)]
To see a complete list of dplyr verbs that support .by
, look
here.
.by
or by
?
As you use per-operation grouping in dplyr, you’ll likely notice that some verbs use .by
and others use by
, for example:
transactions |>
slice_max(revenue, n = 2, by = company)
#> # A tibble: 4 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2019 50
#> 2 A 2019 20
#> 3 B 2023 18
#> 4 B 2023 12
This is a technical difference resulting from the fact that some verbs consistently use a .
prefix for their arguments, and others don’t (see our design notes on the
dot prefix for more details). Most dplyr verbs use .by
, and we’ve tried to ensure that the cases that are most likely to result in typos instead generate an informative error:
# Uses `by` to be consistent with `n` and `prop`
transactions |>
slice_max(revenue, n = 2, .by = company)
#> Error in `slice_max()`:
#> ! Can't specify an argument named `.by` in this verb.
#> ℹ Did you mean to use `by` instead?
# Uses `.by` to be consistent with `.preserve`
transactions |>
slice(revenue, by = company)
#> Error in `slice()`:
#> ! Can't specify an argument named `by` in this verb.
#> ℹ Did you mean to use `.by` instead?
Translating from group_by()
You shouldn’t feel pressured to translate existing code using
group_by()
to use .by
instead.
group_by()
won’t ever disappear, and is not currently being superseded.
That said, if you do want to start using .by
, there are a few differences from
group_by()
to be aware of.
-
.by
always returns an ungrouped data frame. This is one of the main reasons to use.by
, but is worth keeping in mind if you have existing code that takes advantage of persistent grouping fromgroup_by()
. -
.by
uses tidy-selection.group_by()
, on the other hand, works more likemutate()
in that it allows you to create grouping columns on the fly, i.e.df |> group_by(month = floor_date(date, "month"))
. With.by
, you must create your grouping columns ahead of time. An added benefit of.by
's usage of tidy-selection is that you can supply an external character vector of grouping variables using.by = all_of(groups_vec)
. -
.by
doesn’t sort grouping keys.group_by()
always sorts keys in ascending order, which affects the results of verbs likesummarise()
.
The last point might seem strange, but consider what happens if we preferred our transactions data in order by descending year so that the most recent transactions are at the top.
transactions2 <- transactions |>
arrange(company, desc(year))
transactions2
#> # A tibble: 6 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2020 4
#> 2 A 2019 20
#> 3 A 2019 50
#> 4 B 2023 12
#> 5 B 2023 18
#> 6 B 2021 10
# Note that `group_by()` re-ordered
transactions2 |>
group_by(company, year) |>
summarise(total = sum(revenue), .groups = "drop")
#> # A tibble: 4 × 3
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2019 70
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 30
# But `.by` used whatever order was already there
transactions2 |>
summarise(total = sum(revenue), .by = c(company, year))
#> # A tibble: 4 × 3
#> company year total
#> <chr> <dbl> <dbl>
#> 1 A 2020 4
#> 2 A 2019 70
#> 3 B 2023 30
#> 4 B 2021 10
Notice that .by
doesn’t re-sort the grouping keys. Instead, the previous call to
arrange()
is “respected” in the summary (this is also useful in combination with the new .locale
argument to
arrange()
).
We expect that most code won’t depend on the ordering of these group keys, but it is worth keeping in mind if you are switching to .by
. If you did rely on sorted group keys, you currently need to explicitly call
arrange()
either before or after the call to summarise(.by =)
. In a future release, we may add
an argument to control this.
nest(.by = )
The idea behind .by
turns out to be useful in contexts outside of dplyr. In
tidyr 1.3.0,
nest()
gained a .by
argument, allowing you to specify the columns you want to nest by rather than the columns that appear in the nested results, which often makes for more natural calls to
nest()
.
# Specify what to nest by
transactions |>
nest(.by = company)
#> # A tibble: 2 × 2
#> company data
#> <chr> <list>
#> 1 A <tibble [3 × 2]>
#> 2 B <tibble [3 × 2]>
# Specify what to nest
transactions |>
nest(data = !company)
#> # A tibble: 2 × 2
#> company data
#> <chr> <list>
#> 1 A <tibble [3 × 2]>
#> 2 B <tibble [3 × 2]>
# Specify both, allowing you to drop `year` along the way
transactions |>
nest(data = revenue, .by = company)
#> # A tibble: 2 × 2
#> company data
#> <chr> <list>
#> 1 A <tibble [3 × 1]>
#> 2 B <tibble [3 × 1]>
We currently have 3 different nesting variants in the tidyverse:
tidyr::nest()
,
dplyr::group_nest()
, and
dplyr::nest_by()
. Because the tidyr variant is now the most flexible of all of these, and because
unnest()
also lives in tidyr, we are likely to deprecate the two experimental dplyr options in the future.