We’re tickled pink to announce the release of version 0.8.0 of
dplyr, the grammar of data manipulation in the tidyverse.
This is a major update that has kept us busy for almost a year. We take the coincidence of a Valentine’s day release as a sign
of continuous ❤️ for dplyr
's approach to tidy data manipulation.
Important changes are discussed in detail in the pre-release post, we are grateful to members of the community for their feedback in the last couple of months, this has been tremendously useful in making the release process smoother.
The bulk of the changes are internal, and part of an ongoing effort to make the codebase more robust and less surprising. This is an investment that will continue to pay off for years, and serve as a foundation for more innovations in the future.
For a comprehensive list of changes, please see the NEWS for the 0.8.0 release, the sections below discusses the main changes.
Group hug
Grouping has always been at the center of what dplyr
is about, this release expands on the
existing group_by()
with a set of experimental functions with a variety of
perspectives on the notion of grouping.
We believe they offer new unique possibilities, but we welcome community feedback and use cases
before we put a 💍 on them. Let’s illustrate them with a subset from the
well-known gapminder
data.
oceania <- gapminder::gapminder %>%
filter(continent == "Oceania") %>%
mutate(yr1952 = year - 1952) %>%
select(-continent) %>%
group_by(country)
oceania
#> # A tibble: 24 x 6
#> # Groups: country [2]
#> country year lifeExp pop gdpPercap yr1952
#> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Australia 1952 69.1 8691212 10040. 0
#> 2 Australia 1957 70.3 9712569 10950. 5
#> 3 Australia 1962 70.9 10794968 12217. 10
#> 4 Australia 1967 71.1 11872264 14526. 15
#> 5 Australia 1972 71.9 13177000 16789. 20
#> 6 Australia 1977 73.5 14074100 18334. 25
#> 7 Australia 1982 74.7 15184200 19477. 30
#> 8 Australia 1987 76.3 16257249 21889. 35
#> 9 Australia 1992 77.6 17481977 23425. 40
#> 10 Australia 1997 78.8 18565243 26998. 45
#> # … with 14 more rows
-
group_nest() is similar to
tidyr::nest()
, but focuses on the variables to nest by instead of the nested columns.
oceania %>%
group_nest()
#> # A tibble: 2 x 2
#> country data
#> <fct> <list>
#> 1 Australia <tibble [12 × 5]>
#> 2 New Zealand <tibble [12 × 5]>
-
group_split() is a tidy version
of
base::split()
. In particular, it respects agroup_by()
-like grouping specification, and refuses to name its result.
oceania %>%
group_split()
#> [[1]]
#> # A tibble: 12 x 6
#> country year lifeExp pop gdpPercap yr1952
#> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Australia 1952 69.1 8691212 10040. 0
#> 2 Australia 1957 70.3 9712569 10950. 5
#> 3 Australia 1962 70.9 10794968 12217. 10
#> 4 Australia 1967 71.1 11872264 14526. 15
#> 5 Australia 1972 71.9 13177000 16789. 20
#> 6 Australia 1977 73.5 14074100 18334. 25
#> 7 Australia 1982 74.7 15184200 19477. 30
#> 8 Australia 1987 76.3 16257249 21889. 35
#> 9 Australia 1992 77.6 17481977 23425. 40
#> 10 Australia 1997 78.8 18565243 26998. 45
#> 11 Australia 2002 80.4 19546792 30688. 50
#> 12 Australia 2007 81.2 20434176 34435. 55
#>
#> [[2]]
#> # A tibble: 12 x 6
#> country year lifeExp pop gdpPercap yr1952
#> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 New Zealand 1952 69.4 1994794 10557. 0
#> 2 New Zealand 1957 70.3 2229407 12247. 5
#> 3 New Zealand 1962 71.2 2488550 13176. 10
#> 4 New Zealand 1967 71.5 2728150 14464. 15
#> 5 New Zealand 1972 71.9 2929100 16046. 20
#> 6 New Zealand 1977 72.2 3164900 16234. 25
#> 7 New Zealand 1982 73.8 3210650 17632. 30
#> 8 New Zealand 1987 74.3 3317166 19007. 35
#> 9 New Zealand 1992 76.3 3437674 18363. 40
#> 10 New Zealand 1997 77.6 3676187 21050. 45
#> 11 New Zealand 2002 79.1 3908037 23190. 50
#> 12 New Zealand 2007 80.2 4115771 25185. 55
- group_map() and group_walk() offer a way to iterate on groups of a grouped data frame.
oceania %>%
mutate(yr1952 = year - 1952) %>%
group_map(~broom::tidy(stats::lm(lifeExp ~ yr1952, data = .x)))
#> # A tibble: 4 x 6
#> # Groups: country [2]
#> country term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia (Intercept) 68.4 0.337 203. 2.07e-19
#> 2 Australia yr1952 0.228 0.0104 21.9 8.67e-10
#> 3 New Zealand (Intercept) 68.7 0.437 157. 2.66e-18
#> 4 New Zealand yr1952 0.193 0.0135 14.3 5.41e- 8
- group_data(), group_rows(), and group_keys() expose the grouping information, that has been restructured in a tibble.
oceania %>%
group_data()
#> # A tibble: 2 x 2
#> country .rows
#> <fct> <list>
#> 1 Australia <int [12]>
#> 2 New Zealand <int [12]>
oceania %>%
group_keys()
#> # A tibble: 2 x 1
#> country
#> <fct>
#> 1 Australia
#> 2 New Zealand
oceania %>%
group_rows()
#> [[1]]
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12
#>
#> [[2]]
#> [1] 13 14 15 16 17 18 19 20 21 22 23 24
-
group_by() gains a
.drop
argument which you can set toFALSE
to respect empty groups associated with factors (more on this below).
Give factors some love
The internal grouping algorithm has been redesigned to make it possible to
better respect factor levels and empty groups. To limit the disruption, we have not made
this the default behaviour. To keep empty groups,
you have to set
group_by()‘s
.drop
argument to FALSE
.
This can make data manipulation more predictable and reliable, because when factors are involved, the groups are based on the levels of the factors, rather than which levels have data points.
Let’s illustrate this with our favourite flowers 💐,
and a function, species_count()
, that counts the number of each species after
a filter()
, and structures it as a tibble with one column per species.
species_count <- function(...) {
iris %>%
filter(...) %>%
group_by(Species, .drop = FALSE) %>%
summarise(n = n()) %>%
tidyr::spread(Species, n)
}
Because we use .drop = FALSE
we get one column per level of the factor,
even when there’s no data associated with a level:
species_count(Petal.Length > 3)
#> # A tibble: 1 x 3
#> setosa versicolor virginica
#> <int> <int> <int>
#> 1 0 49 50
species_count(Petal.Length > 6.5)
#> # A tibble: 1 x 3
#> setosa versicolor virginica
#> <int> <int> <int>
#> 1 0 0 4
species_count(Petal.Length > 42)
#> # A tibble: 1 x 3
#> setosa versicolor virginica
#> <int> <int> <int>
#> 1 0 0 0
These 0 instead of missing columns make the experience easier when you want to combine multiple results:
limits <- seq(0, 8, by = .5)
limits %>%
purrr::map_dfr( ~species_count(Petal.Length > .x)) %>%
mutate(Sepal.Length = limits) %>%
select(Sepal.Length, everything())
#> # A tibble: 17 x 4
#> Sepal.Length setosa versicolor virginica
#> <dbl> <int> <int> <int>
#> 1 0 50 50 50
#> 2 0.5 50 50 50
#> 3 1 49 50 50
#> 4 1.5 13 50 50
#> 5 2 0 50 50
#> 6 2.5 0 50 50
#> 7 3 0 49 50
#> 8 3.5 0 45 50
#> 9 4 0 34 50
#> 10 4.5 0 14 49
#> 11 5 0 1 41
#> 12 5.5 0 0 25
#> 13 6 0 0 9
#> 14 6.5 0 0 4
#> 15 7 0 0 0
#> 16 7.5 0 0 0
#> 17 8 0 0 0
Thanks
Thanks to all contributors for this release.
@abouf, @adisarid, @adrfantini, @aetiologicCanada, @afdta, @albertomv83, @alistaire47, @aloes2512, @andresimi, @antaldaniel, @AnthonyEbert, @ArtemSokolov, @AshesITR, @bakaburg1, @batpigandme, @bbachrach, @bbolker, @behrman, @BenjaminLouis, @bifouba, @billdenney, @bnicenboim, @BobMuenchen, @brooke-watson, @CarolineBarret, @cbailiss, @CerebralMastication, @cfhammill, @cfry-propeller, @choisy, @ChrisBeeley, @chrsigg, @clauswilke, @ClaytonJY, @colearendt, @ColinFay, @coolbutuseless, @Copepoda, @cpsievert, @dah33, @damianooldoni, @DanChaltiel, @danyal123, @DavisVaughan, @Demetrio92, @dewoller, @dfalbel, @DiogoFerrari, @dirkschumacher, @dmenne, @dmvianna, @dongzhuoer, @earowang, @echasnovski, @eddelbuettel, @EdwinTh, @eijoac, @elbersb, @Eli-Berkow, @EmilHvitfeldt, @epetrovski, @erblast, @etienne-s, @foundinblank, @FrancoisGuillem, @geotheory, @ggrothendieck, @GoldbergData, @gowerc, @grayskripko, @GrimTrigger88, @grizzthepro64, @hadley, @hafen, @heavywatal, @helix123, @henrikmidtiby, @hpeaker, @htc502, @hughjonesd, @ignacio82, @igoldin2u, @igordot, @ilarischeinin, @Ilia-Kosenkov, @IndrajeetPatil, @ipofanes, @jasonmhoule, @jayhesselberth, @jennybc, @jepusto, @jflynn264, @jialu512, @JiaxiangBU, @jimhester, @jkylearmstrongibx, @jnolis, @JohnMount, @jonkeane, @jonthegeek, @jschelbert, @jsekamane, @jtelleria, @kendonB, @kevinykuo, @krlmlr, @langbe, @ldecicco-USGS, @leungi, @libbieweimer, @lionel-, @liz-is, @lloven, @ltrgoddard, @luccastermans, @maicel1978, @Make42, @MalditoBarbudo, @markdly, @markvanderloo, @mattbk, @maxheld83, @melissakey, @mem48, @mgirlich, @mikmart, @MilesMcBain, @minhsphuc12, @mkoohafkan, @momeara, @moodymudskipper, @move[bot], @nealpsmith, @NightWinkle, @o1iv3r, @PascalKieslich, @petermeissner, @peterzsohar, @philstraforelli, @PMassicotte, @PPICARDO, @privefl, @prokulski, @quartin, @rabutler-usbr, @ramongallego, @randomgambit, @rappster, @rensa, @reshmamena, @richard987, @richierocks, @RickPack, @riship2009, @RobertMyles, @romainfrancois, @rontomer, @roumail, @rozsoma, @rundel, @rupesh2017, @s-fleck, @S-UP, @salmansyed0709, @schloerke, @seasmith, @sharlagelfand, @shizidushu, @simon-anasta, @skaltman, @skylarhopkins, @sowla, @statsccpr, @stenhaug, @streamline55, @stuartE9, @stufield, @suzanbaert, @sverchkov, @thackl, @the-knife, @ThiAmm, @thisisnic, @tinyheero, @tmelconian, @tobadia, @tonyelhabr, @torbjorn, @trueNico, @tungmilan, @TylerGrantSmith, @ukkonen, @vincentanutama, @vnijs, @wanfahmi, @waynelapierre, @wch, @wdenton, @wgrundlingh, @wmayner, @wolski, @yiqinfu, @yutannihilation, @Zanidean, @Zedseayou, @zslajchrt, @zx8754, and @zzygyx9119.