We’re happy to announce the release of
dplyr 1.0.4, featuring: two new functions
if_all()
and
if_any()
, and improved performance improvements of
across()
.
You can install it from CRAN with:
install.packages("dplyr")
You can see a full list of changes in the release notes.
if_any() and if_all()
The new
across()
function introduced as part of
dplyr 1.0.0 is proving to be a successful addition to dplyr. In case you missed it,
across()
lets you conveniently express a set of actions to be performed across a tidy selection of columns.
across()
is very useful within
summarise()
and
mutate()
, but it’s hard to use it with
filter()
because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions
if_all()
and
if_any()
. Let’s directly dive in to an example:
library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)
big <- function(x) {
x > mean(x, na.rm = TRUE)
}
# keep rows if all the selected columns are "big"
penguins %>%
filter(if_all(contains("bill"), big))
#> # A tibble: 61 x 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 46 21.5 194 4200
#> 2 Adelie Dream 44.1 19.7 196 4400
#> 3 Adelie Torge… 45.8 18.9 197 4150
#> 4 Adelie Biscoe 45.6 20.3 191 4600
#> 5 Adelie Torge… 44.1 18 210 4000
#> 6 Gentoo Biscoe 44.4 17.3 219 5250
#> 7 Gentoo Biscoe 50.8 17.3 228 5600
#> 8 Chinst… Dream 46.5 17.9 192 3500
#> 9 Chinst… Dream 50 19.5 196 3900
#> 10 Chinst… Dream 51.3 19.2 193 3650
#> # … with 51 more rows, and 2 more variables: sex <fct>, year <int>
# keep rows where at least one of the columns is "big"
penguins %>%
filter(if_any(contains("bill"), big))
#> # A tibble: 296 x 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torge… 39.1 18.7 181 3750
#> 2 Adelie Torge… 39.5 17.4 186 3800
#> 3 Adelie Torge… 40.3 18 195 3250
#> 4 Adelie Torge… 36.7 19.3 193 3450
#> 5 Adelie Torge… 39.3 20.6 190 3650
#> 6 Adelie Torge… 38.9 17.8 181 3625
#> 7 Adelie Torge… 39.2 19.6 195 4675
#> 8 Adelie Torge… 34.1 18.1 193 3475
#> 9 Adelie Torge… 42 20.2 190 4250
#> 10 Adelie Torge… 37.8 17.3 180 3700
#> # … with 286 more rows, and 2 more variables: sex <fct>, year <int>
Both functions operate similarly to
across()
but go the extra mile of aggregating the results to indicate if all the results are true when using
if_all()
, or if at least one is true when using
if_any()
.
Although
if_all()
and
if_any()
were designed with
filter()
in mind, we
then discovered that they can also be useful within
mutate()
and/or
summarise()
:
penguins %>%
filter(!is.na(bill_length_mm)) %>%
mutate(
category = case_when(
if_all(contains("bill"), big) ~ "both big",
if_any(contains("bill"), big) ~ "one big",
TRUE ~ "small"
)) %>%
count(category)
#> # A tibble: 3 x 2
#> category n
#> * <chr> <int>
#> 1 both big 61
#> 2 one big 235
#> 3 small 46
Faster across()
One of the main motivations for across() was eliminating the need for every verb to have a _at
, _if
, and _all
variant. Unfortunately, however, this came with a performance cost. In this release, we have redesigned
across()
to eliminate that performance penalty in many cases. In the following example, you can now see that the old and new approaches take the same amount of time.
library(vroom)
mun2014 <- vroom(
"https://data.regardscitoyens.org/elections/2014_municipales/MN14_Bvot_T1_01-49.txt",
col_select = -c('X4','X9','X10','X11'), col_types = list(), col_names = FALSE,
locale = locale(encoding = "WINDOWS-1252"), altrep = FALSE
)
bench::workout({
a <- mun2014 %>% group_by_if(is.character)
b <- a %>% summarise_if(is.numeric, sum)
})
#> # A tibble: 2 x 3
#> exprs process real
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 a <- mun2014 %>% group_by_if(is.character) 151ms 151ms
#> 2 b <- a %>% summarise_if(is.numeric, sum) 847ms 848ms
bench::workout({
c <- mun2014 %>% group_by(across(where(is.character)))
d <- c %>% summarise(across(where(is.numeric), sum))
})
#> `summarise()` has grouped output by 'X2', 'X3', 'X5'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#> exprs process real
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 c <- mun2014 %>% group_by(across(where(is.character))) 179ms 179ms
#> 2 d <- c %>% summarise(across(where(is.numeric), sum)) 776ms 777ms
Acknowledgements
Merci to all contributors of code, issues and documentation to this release:
@abalter, @cuixueqin, @eggrandio, @everetr, @hadley, @hjohns12, @iago-pssjd, @jahonamir, @krlmlr, @lionel-, @lotard, @luispfonseca, @mbcann01, @mutahiwachira, @Robinlovelace, @romainfrancois, @rpruim, @shahronak47, @shangguandong1996, @sylvainDaubree, @tomazweiss, @vhorvath, @wasdoff, and @Yunuuuu.