We’re thrilled to announce that dtplyr 1.2.0 is now on CRAN. dtplyr gives you the speed of data.table with the syntax of dplyr; you write dplyr (and tidyr) code and dtplyr translates it to the data.table equivalent.
You can install dtplyr from CRAN with:
install.packages("dtplyr")
I’ll discuss three major changes in this blog post:
- New authors
- New tidyr translations
- Improvements to join translations
There are also over 20 minor improvements to the quality of translations; you can see a full list in the release notes.
New authors
The biggest news in this release is the addition of three new authors: Mark Fairbanks, Maximilian Girlich, and Ryan Dickerson are now dtplyr authors in recognition of their significant and sustained contributions. In fact, they implemented the bulk of the improvements in this release!
tidyr translations
dtplyr gains translations for many more tidyr verbs including
complete()
,
drop_na()
,
expand()
,
fill()
,
nest()
,
pivot_longer()
,
replace_na()
, and
separate()
. A few examples are shown below:
dt <- lazy_dt(data.frame(x = c(NA, "x.y", "x.z", "y.z")))
dt %>%
separate(x, c("A", "B"), sep = "\\.", remove = FALSE) %>%
show_query()
#> copy(`_DT1`)[, `:=`(c("A", "B"), tstrsplit(x, split = "\\."))]
dt <- lazy_dt(data.frame(x = c(1, NA, NA, 2, NA)))
dt %>%
fill(x) %>%
show_query()
#> copy(`_DT2`)[, `:=`(x = nafill(x, "locf"))]
dt %>%
replace_na(list(x = 99)) %>%
show_query()
#> copy(`_DT2`)[, `:=`(x = fcoalesce(x, 99))]
dt <- lazy_dt(relig_income)
dt %>%
pivot_longer(!religion, names_to = "income", values_to = "count") %>%
show_query()
#> melt(`_DT3`, measure.vars = c("<$10k", "$10-20k", "$20-30k",
#> "$30-40k", "$40-50k", "$50-75k", "$75-100k", "$100-150k", ">150k",
#> "Don't know/refused"), variable.name = "income", value.name = "count",
#> variable.factor = FALSE)
Improvements to joins
The join functions have been overhauled:
inner_join()
,
left_join()
, and
right_join()
now all produce a call to [
, rather than to
merge()
:
dt1 <- lazy_dt(data.frame(x = 1:3))
dt2 <- lazy_dt(data.frame(x = 2:3, y = c("a", "b")))
dt1 %>% inner_join(dt2, by = "x") %>% show_query()
#> `_DT4`[`_DT5`, on = .(x), nomatch = NULL, allow.cartesian = TRUE]
dt1 %>% left_join(dt2, by = "x") %>% show_query()
#> `_DT5`[`_DT4`, on = .(x), allow.cartesian = TRUE]
dt2 %>% right_join(dt1, by = "x") %>% show_query()
#> `_DT5`[`_DT4`, on = .(x), allow.cartesian = TRUE]
This can make the translation a little longer for simple joins, but it greatly simplifies the underlying code. This simplification has made it easier to more closely match dplyr behaviour for column order, handling named by
specifications, Cartesian joins with by = character()
, and managing duplicated variable names.
Acknowledgements
As always, tidyverse packages wouldn’t be possible with the community, so a big thanks goes out to all 35 folks who helped to make this release a reality: @akr-source, @batpigandme, @bguillod, @cgoo4, @chenx2018, @D-Se, @eutwt, @hadley, @jatherrien, @jdmoralva, @jennybc, @jtlandis, @kmishra9, @lutzgruber, @lutzgruber-quantco, @markfairbanks, @mgirlich, @mrcaseb, @nassuphis, @nigeljmckernan, @NZambranoc, @PMassicotte, @psads-git, @quid-agis, @romainfrancois, @roni-fultheim, @samlipworth, @sanjmeh, @sbashevkin, @StatsGary, @torema-ed, @verajosemanuel, @Waldi73, @wurli, and @yiugn.