dplyr 1.2.0 is out now! This large release of dplyr comes with two sets of exciting features:
filter_out(), the missing complement tofilter(), and accompanyingwhen_any()andwhen_all()helpers.recode_values(),replace_values(), andreplace_when(), three new functions that joincase_when()to create a cohesive family of powerful tools for recoding and replacing.
Both of these sets of features are backed by successful tidyups, the tidyverse’s community facing proposal process ( filtering, recoding). We really enjoyed having the community weigh in on these features!
You can install dplyr 1.2.0 from CRAN with:
install.packages("dplyr")You can see a full list of changes in the release notes.
Expanding the filter() family
filter() has been a core dplyr verb since the very beginning, but over the years we’ve isolated a few key issues with it:
The name
filter()is ambiguous, are you keeping or dropping rows? i.e., are you filtering for rows or filtering out rows?filter()is optimized for the case of keeping rows, but you are just as likely to try and use it for dropping rows. Usingfilter()to drop rows quickly forces you to confront complex boolean logic and explicitly handle missing values, which is difficult to teach, error prone to write, and hard to understand when you come back to it in the future.filter()combines comma separated conditions with&because this covers the majority of the cases. But if you’d like to combine conditions with|, then you have to introduce parentheses around your conditions and combine them into one large condition separated by|, reducing readability.
In the next few sections, we’ll motivate these issues and discuss how some new features in dplyr can simplify things dramatically!
Filtering…out!
Take a look at this patients data. Our task with this data is:
Filter out rows where the patient is deceased and the year was before 2012.
patients <- tibble(
name = c("Anne", "Mark", "Sarah", "Davis", "Max", "Derek", "Tina"),
deceased = c(FALSE, TRUE, NA, TRUE, NA, FALSE, TRUE),
date = c(2005, 2010, NA, 2020, 2010, NA, NA)
)
patients
#> # A tibble: 7 × 3
#> name deceased date
#> <chr> <lgl> <dbl>
#> 1 Anne FALSE 2005
#> 2 Mark TRUE 2010
#> 3 Sarah NA NA
#> 4 Davis TRUE 2020
#> 5 Max NA 2010
#> 6 Derek FALSE NA
#> 7 Tina TRUE NA
With
filter(), you’d probably start by translating “patient is deceased and the year was before 2012” into deceased & date < 2012, and then inverting that with !(<expression>) to drop rows:
patients |>
filter(!(deceased & date < 2012))
#> # A tibble: 3 × 3
#> name deceased date
#> <chr> <lgl> <dbl>
#> 1 Anne FALSE 2005
#> 2 Davis TRUE 2020
#> 3 Derek FALSE NA
That seems to have worked, let’s use an
anti_join() to check with rows have been dropped from patients:
# These rows were dropped
anti_join(
patients,
patients |> filter(!(deceased & date < 2012)),
join_by(name, deceased, date)
)
#> # A tibble: 4 × 3
#> name deceased date
#> <chr> <lgl> <dbl>
#> 1 Mark TRUE 2010
#> 2 Sarah NA NA
#> 3 Max NA 2010
#> 4 Tina TRUE NA
This is subtly wrong! We only wanted to drop rows where we know that the patient was deceased before 2012. If a missing value is present, we don’t want to drop that row because we aren’t sure about the condition. In this case, we were hoping to only drop Mark! It seems like
filter() has unexpectedly dropped more rows than we expected.
Here’s what a technically correct
filter() call might look like:
patients |>
filter(
!((deceased & !is.na(deceased)) &
(date < 2012 & !is.na(date)))
)
#> # A tibble: 6 × 3
#> name deceased date
#> <chr> <lgl> <dbl>
#> 1 Anne FALSE 2005
#> 2 Sarah NA NA
#> 3 Davis TRUE 2020
#> 4 Max NA 2010
#> 5 Derek FALSE NA
#> 6 Tina TRUE NA
That’s horrible! You’ll likely look back on this in a year wondering what you were even trying to do here.
This phenomenon is rather confusing, but is due to the fact that
filter() is designed around the idea that you’re going to tell it which rows to keep. With that design in mind, dropping NAs makes sense, i.e. if you don’t know that you want to keep that row (because an NA is ambiguous), then you probably don’t want to keep it.
This works well until you try to use
filter() as a way to filter out rows, at which point this behavior works against you. At this point, most people (reasonably!) reach for & !is.na() and you end up with the mess from above.
We took a close look at many examples like this one, and eventually realized that the core issue is:
filter()is designed around supplying which rows to keep- We are missing a verb designed around supplying which rows to drop
filter_out() fills that gap:
patients |>
filter_out(deceased, date < 2012)
#> # A tibble: 6 × 3
#> name deceased date
#> <chr> <lgl> <dbl>
#> 1 Anne FALSE 2005
#> 2 Sarah NA NA
#> 3 Davis TRUE 2020
#> 4 Max NA 2010
#> 5 Derek FALSE NA
#> 6 Tina TRUE NA
Just like with
filter(),
filter_out() treats NA values as FALSE. The difference is that
filter_out() expects that you are going to tell it which rows to drop (rather than which rows to keep), so the default behavior of treating NA like FALSE works with you rather than against you. It’s also much easier to understand when you look back on it a year from now!
In general, our advice is that if you find yourself using “negative” operators like != or ! or reaching for the !is.na() pattern to manually handle missing values, try reaching for
filter_out() instead.
Personally, I’ve always been pretty jealous of Stata here because they had both
keep if and drop if, allowing them to write drop if deceased & date < 2012. In my first job, I translated a bunch of Stata code over to R and still remember being frustrated by NA handling every time I had to translate a drop if to a
filter(). With
filter_out(), it feels like I can finally let go of a long term grudge I’ve held over the past 6 years 🙂.
Combining with OR rather than AND
So far, we’ve talked a lot about dropping rows, but dplyr 1.2.0 also has a new feature to help with keeping rows using conditions combined with | -
when_any().
Our goal here is:
Filter for rows where “US” and “CA” have a score between 200-300, or rows where “PR” and “RU” have a score between 100-200.
countries <- tibble(
name = c("US", "CA", "PR", "RU", "US", NA, "CA", "PR"),
score = c(200, 100, 150, NA, 50, 100, 300, 250)
)
countries
#> # A tibble: 8 × 2
#> name score
#> <chr> <dbl>
#> 1 US 200
#> 2 CA 100
#> 3 PR 150
#> 4 RU NA
#> 5 US 50
#> 6 NA 100
#> 7 CA 300
#> 8 PR 250
Here’s a
filter() solution, note how we lose the ability to specify comma separated conditions, and in the process we’ve introduced 3 operators, &, |, and (), decreasing readability and increasing the mental gymnastics required to understand it:
countries |>
filter(
(name %in% c("US", "CA") & between(score, 200, 300)) |
(name %in% c("PR", "RU") & between(score, 100, 200))
)
#> # A tibble: 3 × 2
#> name score
#> <chr> <dbl>
#> 1 US 200
#> 2 PR 150
#> 3 CA 300
With
when_any(), you specify comma separated conditions like you’re used to, but they get combined with | rather than &. This allows us to reduce the amount of operators introduced down to just &, and it remains very readable:
countries |>
filter(when_any(
name %in% c("US", "CA") & between(score, 200, 300),
name %in% c("PR", "RU") & between(score, 100, 200)
))
#> # A tibble: 3 × 2
#> name score
#> <chr> <dbl>
#> 1 US 200
#> 2 PR 150
#> 3 CA 300
when_any() and its counterpart
when_all() aren’t restricted to
filter(). They are normal vector functions that can be used anywhere. And if you’re a package author, you might be interested in
vctrs::vec_pany() and
vctrs::vec_pall(), the underlying low dependency functions that power the dplyr variants.
Reaching recoding nirvana
Over the years, we’ve experimented with various ways of recoding columns and replacing values within them, including:
Despite all of our improvements, it’s felt like there have always been holes in our solutions. Most recently, this came to the forefront in a post about
recoding using a lookup table, which is almost impossible to do with
case_when(), and had people resorting to confusing solutions using the superseded
dplyr::recode() combined with !!! to splice in a lookup table.
After seeing this, we took a step back and were finally able to isolate the issues with our current solutions. The result of our
analysis is three new functions that join
case_when() to form a powerful recoding and replacing family.
It’ll be helpful to define exactly what we mean by recoding vs replacing:
Recoding a column creates an entirely new column using values from an existing column. The new column may have a different type from the original column.
Replacing values within a column partially updates an existing column with new values. The result has the same type as the original column.
The family of functions can be summarized by the following table:
| Recoding | Replacing | |
|---|---|---|
| Match with conditions | case_when() | replace_when() |
| Match with values | recode_values() | replace_values() |
We’ve written a new vignette that expands on all of these from first principles, and in the next few sections we’ll look at some examples.
recode_values()
The goal of the post from above was to recode a numeric column of Likert scale scores into their string counterparts.
We could certainly try
case_when():
likert |>
mutate(
category = case_when(
score == 1 ~ "Strongly disagree",
score == 2 ~ "Disagree",
score == 3 ~ "Neutral",
score == 4 ~ "Agree",
score == 5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 2
#> score category
#> <dbl> <chr>
#> 1 1 Strongly disagree
#> 2 2 Disagree
#> 3 3 Neutral
#> 4 4 Agree
#> 5 5 Strongly agree
#> 6 2 Disagree
#> 7 3 Neutral
#> 8 1 Strongly disagree
#> 9 4 Agree
But score == is repeated so many times! When you find yourself using == in this way, recognize that what you’re really doing is matching on the values of a single column. In cases like these, you’ll want to switch to
recode_values(). Rather than taking logical vectors,
recode_values() takes values on the left-hand side to match against a single input that you’ll provide as the first argument.
likert |>
mutate(
category = score |>
recode_values(
1 ~ "Strongly disagree",
2 ~ "Disagree",
3 ~ "Neutral",
4 ~ "Agree",
5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 2
#> score category
#> <dbl> <chr>
#> 1 1 Strongly disagree
#> 2 2 Disagree
#> 3 3 Neutral
#> 4 4 Agree
#> 5 5 Strongly agree
#> 6 2 Disagree
#> 7 3 Neutral
#> 8 1 Strongly disagree
#> 9 4 Agree
This removes all of the repetition, allowing you to focus on the mapping. And it should feel pretty familiar! This is the same formula interface of
case_when().
If you squint, the mapping should look roughly like a lookup table between the numeric value and the Likert encoding. One of the novel features of
recode_values() is that it has an alternate interface that allows you to make this lookup table more explicit. Using a
tribble(), we can extract out the lookup table into its own standalone data frame.
lookup <- tribble(
~from , ~to ,
1 , "Strongly disagree" ,
2 , "Disagree" ,
3 , "Neutral" ,
4 , "Agree" ,
5 , "Strongly agree" ,
)We can then utilize the alternative from and to arguments of
recode_values() rather than supplying formulas to specify how the values should be recoded:
likert |>
mutate(category = recode_values(score, from = lookup$from, to = lookup$to))
#> # A tibble: 9 × 2
#> score category
#> <dbl> <chr>
#> 1 1 Strongly disagree
#> 2 2 Disagree
#> 3 3 Neutral
#> 4 4 Agree
#> 5 5 Strongly agree
#> 6 2 Disagree
#> 7 3 Neutral
#> 8 1 Strongly disagree
#> 9 4 Agree
Lifting the lookup table out to the top of the file is particularly nice when you have a long pipe chain. The details of the mapping get some room to breathe, and in the pipe chain you can focus on the actual data manipulations.
It’s also very common for your lookup table to exist in a CSV file that you have to read in separately. In that case, you can replace the
tribble() call with:
lookup <- readr::read_csv("lookup.csv")Then everything else works the same.
Unmatched cases
If you are confident that you’ve captured every case during the recoding process, you can now supply unmatched = "error" as an alternative to default.
recode_values() will error if that assertion doesn’t hold. This is great for defensive programming!
# Notice the `0` that we don't have a mapping for!
likert <- tibble(
score = c(0, 1, 2, 2, 4, 5, 2, 3, 1, 4)
)
likert |>
mutate(
score = score |>
recode_values(
from = lookup$from,
to = lookup$to,
unmatched = "error"
)
)
#> Error in `mutate()`:
#> ℹ In argument: `score = recode_values(score, from = lookup$from, to = lookup$to, unmatched = "error")`.
#> Caused by error in `recode_values()`:
#> ! Each location must be matched.
#> ✖ Location 1 is unmatched.
Note that missing values must be explicitly handled when setting unmatched = "error", even if that’s just setting NA ~ NA, otherwise they will trigger the unmatched error. This forces you to explicitly opt in to expecting missing values.
Similar to
recode_values(),
case_when() has also gained the .unmatched argument.
replace_values()
Out of all of the new things introduced in dplyr 1.2.0, I think I’m most excited about
replace_values().
While
recode_values() is great for creating an entirely new column (possibly with a new type), if you just need to replace a few rows of an existing column, then
replace_values() is the best tool for the job!
Imagine we’d like to collapse some, but not all, of these school names into common buckets:
schools <- tibble(
name = c(
"UNC",
"Chapel Hill",
NA,
"Duke",
"Duke University",
"UNC",
"NC State",
"ECU"
)
)We could use
case_when() or
recode_values():
schools |>
mutate(
name = case_when(
name %in% c("UNC", "Chapel Hill") ~ "UNC Chapel Hill",
name %in% c("Duke", "Duke University") ~ "Duke",
.default = name
)
)
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECU
schools |>
mutate(
name = recode_values(
name,
c("UNC", "Chapel Hill") ~ "UNC Chapel Hill",
c("Duke", "Duke University") ~ "Duke",
default = name
)
)
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECU
But this “partial update” operation is so common that it really deserves its own name that doesn’t require you to specify default and is type stable on the input. For that, we have
replace_values():
schools |>
mutate(
name = name |>
replace_values(
c("UNC", "Chapel Hill") ~ "UNC Chapel Hill",
c("Duke", "Duke University") ~ "Duke"
)
)
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECU
Notice how pipe friendly
replace_values() is! The first input is your “primary” input, and you can expect the output to have the same type and size as that input.
Like
recode_values(),
replace_values() has an alternative from and to API that works well with lookup tables and allows you to move your mapping out of the pipe chain:
lookup <- tribble(
~from , ~to ,
"UNC" , "UNC Chapel Hill" ,
"Chapel Hill" , "UNC Chapel Hill" ,
"Duke" , "Duke" ,
"Duke University" , "Duke" ,
)
schools |>
mutate(name = replace_values(name, from = lookup$from, to = lookup$to))
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECU
An extremely neat feature of the from and to API is that they also take lists of vectors that describe the mapping, which has been designed to work elegantly with the fact that
tribble() can create list columns, allowing you to further collapse this lookup table:
# Condensed lookup table with a `many:1` mapping per row
lookup <- tribble(
~from , ~to ,
c("UNC", "Chapel Hill") , "UNC Chapel Hill" ,
c("Duke", "Duke University") , "Duke" ,
)
# Note that `from` is a list column
lookup
#> # A tibble: 2 × 2
#> from to
#> <list> <chr>
#> 1 <chr [2]> UNC Chapel Hill
#> 2 <chr [2]> Duke
lookup$from
#> [[1]]
#> [1] "UNC" "Chapel Hill"
#>
#> [[2]]
#> [1] "Duke" "Duke University"
# Works the same as before
schools |>
mutate(name = replace_values(name, from = lookup$from, to = lookup$to))
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECU
The formula interface of
replace_values() is a bit of a Swiss Army knife for all manner of scenarios where you might have previously reached for
dplyr::coalesce(),
dplyr::na_if(), or
tidyr::replace_na():
state <- c("NC", "NY", "CA", NA, "NY", "Unknown", NA)
# Replace missing values with a constant
replace_values(state, NA ~ "Unknown")
#> [1] "NC" "NY" "CA" "Unknown" "NY" "Unknown" "Unknown"
# Replace missing values with the corresponding value from another column
region <- c("South", "North", "West", "East", "North", "Unknown", "West")
replace_values(state, NA ~ region)
#> [1] "NC" "NY" "CA" "East" "NY" "Unknown" "West"
# Replace problematic values with a missing value
replace_values(state, "Unknown" ~ NA)
#> [1] "NC" "NY" "CA" NA "NY" NA NA
# Standardize multiple issues at once
replace_values(state, c(NA, "Unknown") ~ "<missing>")
#> [1] "NC" "NY" "CA" "<missing>" "NY" "<missing>"
#> [7] "<missing>"
We also think it better expresses intent than
if_else() or
case_when() when performing a partial update:
# - Type stable on `x`
# - Intent of "partially updating" `state` is clear
# - Pipe friendly
state |> replace_values(NA ~ "Unknown")
# Historically this has been "the way" to do a partial update,
# but it's odd that the "primary" input is at the end!
if_else(is.na(state), "Unknown", state)
case_when(is.na(state) ~ "Unknown", .default = state)If you’re a package author, you’ll probably also be interested in
vctrs::vec_recode_values() and
vctrs::vec_replace_values(), which are low dependency functions that power the dplyr variants.
What about case_match()?
We’ve soft-deprecated
case_match() in favor of
recode_values(), which is a drop in replacement.
case_match() was an incremental step towards this recoding family, but:
- It has a pretty confusing name compared with
recode_values(). - It lacked a way to work with lookup tables, like
fromandto. - It lacks a replacement variant, like
replace_values().
Rather than keeping
case_match() around indefinitely, we’ve decided to initiate the process of its removal since it was only introduced in dplyr 1.1.0.
Deprecations
dplyr 1.2.0 advances the lifecycle stage of many deprecated functions. These deprecations have been in the works for many years now, due to our slow and very deliberate deprecation process via the lifecycle package. We’ll cover the highlights, and you can find the full list here.
For any packages that we broke via these deprecations, we provided a pull request (or at least an issue, for complex cases) and some advance warning. We semi-automated some of this process using Claude Code, which you can read about here.
All underscored verbs have moved from deprecated to defunct, such as
mutate_()andarrange_(). These have been deprecated since dplyr 0.7.0 back in 2017 (yes, 2017!!). Use the non-underscored versions, seevignette("programming")for details.mutate_each()andsummarise_each()have moved from deprecated to defunct. These were also deprecated in dplyr 0.7.0. Usemutate()andsummarise()withacross()instead.Returning more or less than 1 row per group in
summarise()has moved from deprecated to defunct. This was deprecated in dplyr 1.1.0 in 2023 after we realized that this was an unsafe feature forsummarise(), which you should always expect to return exactly 1 row per group.reframe()is a drop in replacement when you need this.In
case_when(), supplying all size 1 LHS inputs along with a size >1 RHS input is now soft-deprecated. This is an improper usage ofcase_when()that should instead be a series of if statements, like:# Scalars! code <- 1L flavor <- "vanilla" # Improper usage: case_when( code == 1L && flavor == "chocolate" ~ x, code == 1L && flavor == "vanilla" ~ y, code == 2L && flavor == "vanilla" ~ z, .default = default ) # Recommended: if (code == 1L && flavor == "chocolate") { x } else if (code == 1L && flavor == "vanilla") { y } else if (code == 2L && flavor == "vanilla") { z } else { default }The recycling behavior that allows this style of
case_when()to work is unsafe, and can result in silent bugs that we’d like to guard against with an error in the future. See this issue for context.The
dplyr.legacy_localeglobal option is soft-deprecated. If you used this to affect the ordering ofarrange(), usearrange(.locale =)instead. If you used this to affect the ordering ofgroup_by() |> summarise(), follow up with an additional call toarrange(.locale =)instead.if_else()no longer allowsconditionto be a logical array. It must be a logical vector with nodimattribute.We’ve removed a number of previously defunct functions, shrinking the footprint of dplyr’s API:
id()failwith()select_vars()andselect_vars_()rename_vars()andrename_vars_()select_var()current_vars()bench_tbls(),compare_tbls(),compare_tbls2(),eval_tbls(), andeval_tbls2()location()andchanges()
Acknowledgements
We’d like to thank all 177 contributors who help in someway, whether it was filing issues or contributing code and documentation: @abalter, @abichat, @adupaix, @AlexBainton, @alexmcsw, @AltfunsMA, @AmeliaMN, @antdurrant, @AnthonyEbert, @apalacio9502, @apeterson91, @arnaudgallou, @awpsoras, @bakaburg1, @barnabasharris, @BHII-KSC, @bholtemeyer, @billdenney, @bounlu, @brendensm, @bridroberts1, @brookslogan, @catalamarti, @cboettig, @cbrnr, @ccani007, @charliejhadley, @ChrisHIV, @ChristianRohde, @cobac, @conig, @const-ae, @Copilot, @d-morrison, @DanChaltiel, @daniel-simeone, @DanielBraddock, @david-romano, @davidrsch, @davidss101, @DavisVaughan, @dcaud, @deschen1, @DesiQuintans, @devster31, @dkutner, @dmuenz, @ds-jim, @eitsupi, @EmilHvitfeldt, @etiennebacher, @eutwt, @EvertonTLima, @ferreirafm, @gaborcsardi, @GabryS3, @Gastonia02, @GBarnsley, @gevro, @ggrothendieck, @GischD, @gks281263, @gracehartley, @graphdr, @hadley, @heliconone, @Hzanib, @ilovemane, @ja-ortiz-uniandes, @jack-davison, @james-kilgour, @JamesHWade, @jaymicro, @JBrandenburg02, @jc-usda, @jennybc, @jeroenjanssens, @jestover, @jl5000, @jmbarbone, @john-b-edwards, @jordanmross, @joshua-theisen, @jrwinget, @juliaapolonio, @jxu, @KaiAragaki, @kiki830621, @KittJonathan, @kleinerChemiker, @kletts, @krlmlr, @ks8997, @kylebutts, @larsentom, @latot, @lboller-pwbm, @lionel-, @Longfei2, @lschneiderbauer, @LukasTang, @lukebandy, @maciekbanas, @maelle, @marcuslehr, @Mark-AP, @markwestcott34, @maskegger, @matiasandina, @matthewjnield, @mbcann01, @Meghansaha, @metanoid, @MichaelChirico, @MikeJohnPage, @MilesMcBain, @mine-cetinkaya-rundel, @MohsenSoltanifar, @moodymudskipper, @Moohan, @mp8, @mpsturbo, @mr-c, @muschellij2, @musvaage, @Mzhuk7, @nalimilan, @nathanhaigh, @nirguk, @nmercadeb, @olivermagnanimous, @olivroy, @orgadish, @pangchaoran, @paschatz, @prubin73, @PStaus, @psychelzh, @py9mrg, @Raesu, @randyzwitch, @Raoul-Kima, @ReedMerrill, @RodDalBen, @RodrigoZepeda, @rossholmberg, @RoyalTS, @ryandward, @sbanville-delfi, @ScientiaFelis, @shirdekel, @slager, @sschooler, @steffen-stell, @szimmer, @TheClownBongo, @thomasjwood, @TimTaylor, @tlyons253, @tomalrussell, @tomwagstaff-opml, @torfason, @Tyrrx, @Unaimend, @VisruthSK, @vorpalvorpal, @walkerjameschris, @wbvguo, @wbzyl, @wkumler, @yaboody, @yjunechoe, @ynsec37, @ywhcuhk, @ZHBHSMILE, @zhjx19, and @ZIBOWANGKANGYU.