We’re pleased to announce the release of tidyr 1.3.0. tidyr provides a set of tools for transforming data frames to and from tidy data, where each variable is a column and each observation is a row. Tidy data is a convention for matching the semantics and structure of your data that makes using the rest of the tidyverse (and many other R packages) much easier.
You can install it from CRAN with:
install.packages("tidyr")
This post highlights the biggest changes in this release:
-
A new family of
separate_*()
functions supersedeseparate()
andextract()
and come with useful debugging features. -
unnest_wider()
andunnest_longer()
gain a bundle of useful improvements. -
pivot_longer()
gets a newcols_vary
argument. -
nest(.by)
provides a new (and hopefully final) way to create nested datasets.
You should also notice generally improved errors with this release: we check function arguments more aggressively, and take care to always report the name of the function that you called, not some internal helper. As usual, you can find a full set of changes in the release notes.
separate_*()
family of functions
The biggest feature of this release is a new, experimental family of functions for separating string columns:
Make columns | Make rows | |
---|---|---|
Separate with delimiter |
separate_wider_delim() |
separate_longer_delim() |
Separate by position |
separate_wider_position() |
separate_longer_position() |
Separate with regular expression |
separate_wider_regex() |
These functions collectively supersede
extract()
,
separate()
, and
separate_rows()
because they have more consistent names and arguments, have better performance (thanks to stringr), and provide a new approach for handling problems.
Make columns | Make rows | |
---|---|---|
Separate with delimiter | separate(sep = string) |
separate_rows() |
Separate by position | separate(sep = integer vector) |
N/A |
Separate with regular expression |
extract() |
Here I’ll focus on the wider
functions because they generally present the most interesting challenges. Let’s start by grabbing some census data with the
tidycensus package:
vt_census <- tidycensus::get_decennial(
geography = "block",
state = "VT",
county = "Washington",
variables = "P1_001N",
year = 2020
)
#> Getting data from the 2020 decennial Census
#> Using the PL 94-171 Redistricting Data summary file
#> Note: 2020 decennial Census data use differential privacy, a technique that
#> introduces errors into data to preserve respondent confidentiality.
#> ℹ Small counts should be interpreted with caution.
#> ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
#> This message is displayed once per session.
vt_census
#> # A tibble: 2,150 × 4
#> GEOID NAME variable value
#> <chr> <chr> <chr> <dbl>
#> 1 500239555021014 Block 1014, Block Group 1, Census Tract 9555.… P1_001N 21
#> 2 500239555021015 Block 1015, Block Group 1, Census Tract 9555.… P1_001N 19
#> 3 500239555021016 Block 1016, Block Group 1, Census Tract 9555.… P1_001N 0
#> 4 500239555021017 Block 1017, Block Group 1, Census Tract 9555.… P1_001N 0
#> 5 500239555021018 Block 1018, Block Group 1, Census Tract 9555.… P1_001N 43
#> 6 500239555021019 Block 1019, Block Group 1, Census Tract 9555.… P1_001N 68
#> 7 500239555021020 Block 1020, Block Group 1, Census Tract 9555.… P1_001N 30
#> 8 500239555021021 Block 1021, Block Group 1, Census Tract 9555.… P1_001N 0
#> 9 500239555021022 Block 1022, Block Group 1, Census Tract 9555.… P1_001N 18
#> 10 500239555021023 Block 1023, Block Group 1, Census Tract 9555.… P1_001N 93
#> # … with 2,140 more rows
The GEOID
column is made up of four components: a 2-digit state identifier, a 3-digit county identifier, a 6-digit tract identifier, and a 4-digit block identifier. We can use
separate_wider_position()
to extract these into their own variables:
vt_census |>
select(GEOID) |>
separate_wider_position(
GEOID,
widths = c(state = 2, county = 3, tract = 6, block = 4)
)
#> # A tibble: 2,150 × 4
#> state county tract block
#> <chr> <chr> <chr> <chr>
#> 1 50 023 955502 1014
#> 2 50 023 955502 1015
#> 3 50 023 955502 1016
#> 4 50 023 955502 1017
#> 5 50 023 955502 1018
#> 6 50 023 955502 1019
#> 7 50 023 955502 1020
#> 8 50 023 955502 1021
#> 9 50 023 955502 1022
#> 10 50 023 955502 1023
#> # … with 2,140 more rows
The name
column contains this same information in a text form, with each component separated by a comma. We can use
separate_wider_delim()
to break up this sort of data into individual variables:
vt_census |>
select(NAME) |>
separate_wider_delim(
NAME,
delim = ", ",
names = c("block", "block_group", "tract", "county", "state")
)
#> # A tibble: 2,150 × 5
#> block block_group tract county state
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Block 1014 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 2 Block 1015 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 3 Block 1016 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 4 Block 1017 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 5 Block 1018 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 6 Block 1019 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 7 Block 1020 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 8 Block 1021 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 9 Block 1022 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 10 Block 1023 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> # … with 2,140 more rows
You’ll notice that each row contains a lot of duplicated information (“Block”, “Block Group”, …). You could certainly use
mutate()
and string manipulation to clean this up, but there’s a more direct approach that you can use if you’re familiar with regular expressions. The new
separate_wider_regex()
takes a vector of regular expressions that are matched in order, from left to right. If you name the regular expression, it will appear in the output; otherwise, it will be dropped. I think this leads to a particularly elegant solution to many problems.
vt_census |>
select(NAME) |>
separate_wider_regex(
NAME,
patterns = c(
"Block ", block = "\\d+", ", ",
"Block Group ", block_group = "\\d+", ", ",
"Census Tract ", tract = "\\d+.\\d+", ", ",
county = "[^,]+", ", ",
state = ".*"
)
)
#> # A tibble: 2,150 × 5
#> block block_group tract county state
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1014 1 9555.02 Washington County Vermont
#> 2 1015 1 9555.02 Washington County Vermont
#> 3 1016 1 9555.02 Washington County Vermont
#> 4 1017 1 9555.02 Washington County Vermont
#> 5 1018 1 9555.02 Washington County Vermont
#> 6 1019 1 9555.02 Washington County Vermont
#> 7 1020 1 9555.02 Washington County Vermont
#> 8 1021 1 9555.02 Washington County Vermont
#> 9 1022 1 9555.02 Washington County Vermont
#> 10 1023 1 9555.02 Washington County Vermont
#> # … with 2,140 more rows
These functions also have a new way to report problems. Let’s start with a very simple example:
df <- tibble(
id = 1:3,
x = c("a", "a-b", "a-b-c")
)
df |> separate_wider_delim(x, delim = "-", names = c("x", "y"))
#> Error in `separate_wider_delim()`:
#> ! Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
We’ve requested two columns in the output (x
and y
), but the first row has only one element and the last row has three elements, so
separate_wider_delim()
can’t do what we’ve asked. The error lays out your options for resolving the problem using the too_few
and too_many
arguments. I’d recommend always starting with "debug"
to get more information about the problem:
probs <- df |>
separate_wider_delim(
x,
delim = "-",
names = c("a", "b"),
too_few = "debug",
too_many = "debug"
)
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and `x_remainder`.
probs
#> # A tibble: 3 × 7
#> id a b x x_ok x_pieces x_remainder
#> <int> <chr> <chr> <chr> <lgl> <int> <chr>
#> 1 1 a NA a FALSE 1 ""
#> 2 2 a b a-b TRUE 2 ""
#> 3 3 a b a-b-c FALSE 3 "-c"
This adds three new variables: x_ok
tells you if the x
could be separated as you requested, x_pieces
tells you the actual number of pieces, and x_remainder
shows you anything that remains after the columns you asked for. You can use this information to fix the problems in the input, or you can use the other options to too_few
and too_many
to tell
separate_wider_delim()
to fix them for you:
df |>
separate_wider_delim(
x,
delim = "-",
names = c("a", "b"),
too_few = "align_start",
too_many = "drop"
)
#> # A tibble: 3 × 3
#> id a b
#> <int> <chr> <chr>
#> 1 1 a NA
#> 2 2 a b
#> 3 3 a b
too_few
and too_many
also work with
separate_wider_position()
, and too_few
works with
separate_wider_regex()
. The longer
variants don’t need these arguments because varying numbers of rows don’t matter in the same way that varying numbers of columns do:
df |> separate_longer_delim(x, delim = "-")
#> # A tibble: 6 × 2
#> id x
#> <int> <chr>
#> 1 1 a
#> 2 2 a
#> 3 2 b
#> 4 3 a
#> 5 3 b
#> 6 3 c
These functions are still experimental so we are actively seeking feedback. Please try them out and let us know if you find them useful or if there are other features you’d like to see.
unnest_wider()
and unnest_longer()
improvements
unnest_longer()
and
unnest_wider()
have both received some quality of life and consistency improvements. Most importantly:
-
unnest_wider()
now gives a better error when unnesting an unnamed vector:df <- tibble( id = 1:2, x = list(c("a", "b"), c("d", "e", "f")) ) df |> unnest_wider(x) #> Error in `unnest_wider()`: #> ℹ In column: `x`. #> ℹ In row: 1. #> Caused by error: #> ! Can't unnest elements with missing names. #> ℹ Supply `names_sep` to generate automatic names. df |> unnest_wider(x, names_sep = "_") #> # A tibble: 2 × 4 #> id x_1 x_2 x_3 #> <int> <chr> <chr> <chr> #> 1 1 a b NA #> 2 2 d e f
And this same behaviour now also applies to partially named vectors.
-
unnest_longer()
has gained akeep_empty
argument likeunnest()
, and it now treatsNULL
and empty vectors the same way:df <- tibble( id = 1:3, x = list(NULL, integer(), 1:3) ) df |> unnest_longer(x) #> # A tibble: 3 × 2 #> id x #> <int> <int> #> 1 3 1 #> 2 3 2 #> 3 3 3 df |> unnest_longer(x, keep_empty = TRUE) #> # A tibble: 5 × 2 #> id x #> <int> <int> #> 1 1 NA #> 2 2 NA #> 3 3 1 #> 4 3 2 #> 5 3 3
pivot_longer(cols_vary)
By default,
pivot_longer()
creates its output row-by-row:
df <- tibble(
x = 1:2,
y = 3:4,
z = 5:6
)
df |>
pivot_longer(
everything(),
names_to = "name",
values_to = "value"
)
#> # A tibble: 6 × 2
#> name value
#> <chr> <int>
#> 1 x 1
#> 2 y 3
#> 3 z 5
#> 4 x 2
#> 5 y 4
#> 6 z 6
You can now request to create the output column-by-column with cols_vary = "slowest":
df |>
pivot_longer(
everything(),
names_to = "name",
values_to = "value",
cols_vary = "slowest"
)
#> # A tibble: 6 × 2
#> name value
#> <chr> <int>
#> 1 x 1
#> 2 x 2
#> 3 y 3
#> 4 y 4
#> 5 z 5
#> 6 z 6
nest(.by)
A nested data frame is a data frame where one (or more) columns is a list of data frames. Nested data frames are a powerful tool that allow you to turn groups into rows and can facilitate certain types of data manipulation that would be very tricky otherwise. (One place to learn more about them is my 2016 talk “ Managing many models with R".)
Over the years we’ve made a number of attempts at getting the correct interface for nesting, including
tidyr::nest()
,
dplyr::nest_by()
, and
dplyr::group_nest()
. In this version of tidyr we’ve taken one more stab at it by adding a new argument to
nest()
: .by
, inspired by the upcoming
dplyr 1.1.0 release. This means that
nest()
now allows you to specify the variables you want to nest by as an alternative to specifying the variables that appear in the nested data.
# Specify what to nest by
mtcars |>
nest(.by = cyl)
#> # A tibble: 3 × 2
#> cyl data
#> <dbl> <list>
#> 1 6 <tibble [7 × 10]>
#> 2 4 <tibble [11 × 10]>
#> 3 8 <tibble [14 × 10]>
# Specify what should be nested
mtcars |>
nest(data = -cyl)
#> # A tibble: 3 × 2
#> cyl data
#> <dbl> <list>
#> 1 6 <tibble [7 × 10]>
#> 2 4 <tibble [11 × 10]>
#> 3 8 <tibble [14 × 10]>
# Specify both (to drop variables)
mtcars |>
nest(data = mpg:drat, .by = cyl)
#> # A tibble: 3 × 2
#> cyl data
#> <dbl> <list>
#> 1 6 <tibble [7 × 5]>
#> 2 4 <tibble [11 × 5]>
#> 3 8 <tibble [14 × 5]>
If this function is all we hope it to be, we’re likely to supersede
dplyr::nest_by()
and
dplyr::group_nest()
in the future. This has the nice property of placing the functions for nesting and unnesting in the same package (tidyr).
Acknowledgements
A big thanks to all 51 contributors who helped make this release possible, by writing code and documentating, asking questions, and reporting bugs! @AdrianS85, @ahcyip, @allenbaron, @AnBarbosaBr, @ArthurAndrews, @bart1, @billdenney, @bknakker, @bwiernik, @crissthiandi, @daattali, @DavisVaughan, @dcaud, @DSLituiev, @elgabbas, @fabiangehring, @hadley, @ilikegitlab, @jennybc, @jic007, @Joao-O-Santos, @joeycouse, @jonspring, @kevinushey, @krlmlr, @lionel-, @lotard, @lschneiderbauer, @lucylgao, @markfairbanks, @martina-starc, @MatthieuStigler, @mattnolan001, @mattroumaya, @mdkrause, @mgirlich, @millermc38, @modche, @moodymudskipper, @mspittler, @olivroy, @piokol23, @ppreshant, @ramiromagno, @Rengervn, @rjake, @roohitk, @struckma, @tjmahr, @weirichs, and @wurli.