We’re pleased to announce the release of dbplyr 2.0.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.
You can install it from CRAN with:
install.packages("dbplyr")
This blog post covers the major improvements in this version:
-
dplyr 1.0.0 compatibility so you can now use
across()
,relocate()
,rename_with()
, and more. -
Major improvements to SQL translation.
-
A snazzy new logo from Allison Horst.
-
An improved extension system.
Please see the release notes for a full list of changes.
dplyr 1.0.0 compatibility
dbplyr now supports all relevant features added in dplyr 1.0.0:
-
across()
is now translated into individual SQL statements.lf <- lazy_frame(g = 1, a = 1, b = 2, c = 3) lf %>% group_by(g) %>% summarise(across(everything(), mean, na.rm = TRUE)) #> <SQL> #> SELECT `g`, AVG(`g`) AS `g`, AVG(`a`) AS `a`, AVG(`b`) AS `b`, AVG(`c`) AS `c` #> FROM `df` #> GROUP BY `g`
-
rename()
andselect()
support dplyr tidyselect syntax, apart from predicate functions which can’t easily work on computed queries. You can now userename_with()
to programmatically rename columns.lf <- lazy_frame(x1 = 1, x2 = 2, x3 = 3, y1 = 4, y2 = 3) lf %>% select(starts_with("x") & !"x3") #> <SQL> #> SELECT `x1`, `x2` #> FROM `df` lf %>% select(ends_with("2") | ends_with("3")) #> <SQL> #> SELECT `x2`, `y2`, `x3` #> FROM `df` lf %>% rename_with(toupper) #> <SQL> #> SELECT `x1` AS `X1`, `x2` AS `X2`, `x3` AS `X3`, `y1` AS `Y1`, `y2` AS `Y2` #> FROM `df`
-
relocate()
makes it easy to move columns around:lf <- lazy_frame(x1 = 1, x2 = 2, y1 = 4, y2 = 3) lf %>% relocate(starts_with("y")) #> <SQL> #> SELECT `y1`, `y2`, `x1`, `x2` #> FROM `df`
-
slice_min()
,slice_max()
, andslice_sample()
are now supported, andslice_head()
andslice_tail()
throw informative error messages (since they don’t make sense for databases).lf <- lazy_frame(g = rep(1:2, 5), x = 1:10) lf %>% group_by(g) %>% slice_min(x, prop = 0.5) #> <SQL> #> SELECT `g`, `x` #> FROM (SELECT `g`, `x`, CUME_DIST() OVER (PARTITION BY `g` ORDER BY `x`) AS `q01` #> FROM `df`) `q01` #> WHERE (`q01` <= 0.5) lf %>% group_by(g) %>% slice_sample(x, n = 10, with_ties = TRUE) #> <SQL> #> SELECT `g`, `x` #> FROM (SELECT `g`, `x`, ROW_NUMBER() OVER (PARTITION BY `g` ORDER BY random()) AS `q01` #> FROM `df`) `q01` #> WHERE (`q01` <= 10)
Note that these slices are translated into window functions, and because you can’t use a window function directly inside a
WHERE
clause, they must be wrapped in a subquery.
SQL translation
The dbplyr documentation now does a much better job of providing the details of its SQL translation. Each backend and each major verb has a documentation page giving the basics of the translation. This will hopefully make it much easier to learn what is and isn’t supported by dbplyr. Visit https://dbplyr.tidyverse.org/reference/index.html to see the new docs.
There are also many improvements to SQL generation. Here are a few of the most important:
-
Join functions gain an
na_matches
argument that allows you to control whether or notNA
(NULL
) values match otherNA
values. The default is"never"
, which is the usual behaviour in databases. You can setna_matches = "na"
to match R’s usual join behaviour.df1 <- tibble(x = c(1, 2, NA)) df2 <- tibble(x = c(NA, 1), y = 1:2) df1 %>% inner_join(df2, by = "x") #> # A tibble: 2 x 2 #> x y #> <dbl> <int> #> 1 1 2 #> 2 NA 1 db1 <- memdb_frame(x = c(1, 2, NA)) db2 <- memdb_frame(x = c(NA, 1), y = 1:2) db1 %>% inner_join(db2, by = "x") #> # Source: lazy query [?? x 2] #> # Database: sqlite 3.34.1 [:memory:] #> x y #> <dbl> <int> #> 1 1 2 db1 %>% inner_join(db2, by = "x", na_matches = "na") #> # Source: lazy query [?? x 2] #> # Database: sqlite 3.34.1 [:memory:] #> x y #> <dbl> <int> #> 1 NA 1 #> 2 1 2
This translation is powered by the new
sql_expr_matches()
generic, because every database seems to have a slightly different way to express this idea. Learn more at https://modern-sql.com/feature/is-distinct-from.db1 %>% inner_join(db2, by = "x") %>% show_query() #> <SQL> #> SELECT `LHS`.`x` AS `x`, `y` #> FROM `dbplyr_001` AS `LHS` #> INNER JOIN `dbplyr_002` AS `RHS` #> ON (`LHS`.`x` = `RHS`.`x`) db1 %>% inner_join(db2, by = "x", na_matches = "na") %>% show_query() #> <SQL> #> SELECT `LHS`.`x` AS `x`, `y` #> FROM `dbplyr_001` AS `LHS` #> INNER JOIN `dbplyr_002` AS `RHS` #> ON (`LHS`.`x` IS `RHS`.`x`)
-
Subqueries no longer include an
ORDER BY
clause. This is not part of the formal SQL specification so it has very limited support across databases. Now such queries generate a warning suggesting that you move yourarrange()
call later in the pipeline.lf <- lazy_frame(g = rep(1:2, each = 5), x = sample(1:10)) lf %>% group_by(g) %>% summarise(n = n()) %>% arrange(desc(n)) %>% filter(n > 1) #> Warning: ORDER BY is ignored in subqueries without LIMIT #> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead? #> <SQL> #> SELECT * #> FROM (SELECT `g`, COUNT(*) AS `n` #> FROM `df` #> GROUP BY `g`) `q01` #> WHERE (`n` > 1.0)
As the warning suggests, there’s one exception:
ORDER BY
is still generated if aLIMIT
is present. Across databases, this tends to change which rows are returned, but not necessarily their order. -
dbplyr includes built-in backends for Redshift (which only differs from PostgreSQL in a few places) and SAP HANA. These require the development versions of RPostgres and odbc respectively.
lf <- lazy_frame(x = "a", y = "b", con = simulate_redshift()) lf %>% mutate(z = paste0(x, y)) #> <SQL> #> SELECT `x`, `y`, `x` || `y` AS `z` #> FROM `df`
There are a number of minor changes that affect the translation of individual functions. Here are a few of the most important:
-
All backends now translate
n()
tocount(*)
and support::
lf <- lazy_frame(x = 1:10) lf %>% summarise(n = dplyr::n()) #> <SQL> #> SELECT COUNT(*) AS `n` #> FROM `df`
-
PostgreSQL gets translations for lubridate period functions:
lf <- lazy_frame(x = Sys.Date(), con = simulate_postgres()) lf %>% mutate(year = x + years(1)) #> <SQL> #> SELECT `x`, `x` + CAST('1 years' AS INTERVAL) AS `year` #> FROM `df`
-
Oracle assumes version 12c is available so we can use a simpler translation for
head()
that works in more places:lf <- lazy_frame(x = 1, con = simulate_oracle()) lf %>% head(5) #> <SQL> #> SELECT * #> FROM (`df`) #> FETCH FIRST 5 ROWS ONLY
New logo
Thanks to the artistic talents of Allison Horst, dbplyr has a beautiful new logo:
Extensibility
Finally, dbplyr introduces a number of new generics to help tease apart the currently overly complicated relationship with dplyr. This should make creating new backends much easier, but does require some changes from existing backends. These changes should be invisible to the end user and will play out slowly over the next 12 months. See
vignette("backend-2", package = "dbplyr")
for details.
Acknowledgements
A big thanks to everyone who helped with this release by reporting bugs, discussing issues, and contributing code: @abalter, @adhi-r, @adithya604, @admoseremic, @ahmed-alhindawi, @alexfun, @alexkyllo, @alistaire47, @batpigandme, @BenCarlsen, @bengowan, @bersbersbers, @bertrandh, @bkkkk, @boshek, @bradenkinard, @cderv, @CerebralMastication, @chris-billingham, @cmichaud92, @cole-johanson, @copernican, @daattali, @Daveyr, @davidchall, @DavidPatShuiFong, @dereksonderegger, @dfrankow, @dkulp2, @dpprdan, @dsen6644, @DSLituiev, @EarlGlynn, @edgararuiz, @edoardomichielon, @elbamos, @ericemc3, @fahadshery, @fh-jgutman, @ftoresh, @GrayAlex49, @gregleleu, @hadley, @halpo, @hannes101, @hansvancalster, @hrbrmstr, @huelf, @iangow, @ianmcook, @jakeybob, @Janlow, @jarodmeng, @javierluraschi, @jerisalan, @jessekps, @jimhester, @jkylearmstrong, @jmerone, @jonkeane, @kmishra9, @kohleth, @kondofersky, @krlmlr, @lionel-, @lorenzwalthert, @LukasWallrich, @lukerobert, @lymanmark, @machow, @martin-a-wade, @mgirlich, @MikeJohnPage, @millerh1, @mkirzon, @moodymudskipper, @mskyttner, @Naareman, @natbprice, @okhoma, @OssiLehtinen, @PauloJhonny, @r2evans, @ramnathv, @returnString, @rjpat, @rlh1994, @robchallen, @roboton, @romainfrancois, @rundel, @saadaslam, @samssann, @samstiyer, @schradj, @sheepworrier, @shosaco, @shyams80, @stiberger, @stvrd, @tarunn90, @tedmoorman, @tgvaughan, @tonyk7440, @trevorcampbell, @TuomoNieminen, @tvedebrink, @vadimus202, @vnijs, @wangyuchen, @yitao-li, and @ZahraEconomist.