We’re chuffed to announce the release of dbplyr 2.4.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.
You can install it from CRAN with:
install.packages("dbplyr")
This blog post will highlight some of the most important new features: eliminating subqueries when using multiple unions in a row, getting more control on the generated SQL, and a handful of new translations. As usual, release comes with a large number of improvements to translations for individual backends; see the full list in the release notes
SQL optimisation
dbplyr now produces fewer subqueries when combining tables with
union()
and
union_all()
resulting in shorter, more readable, and, in some cases, faster SQL.
lf1 <- lazy_frame(x = 1, y = "a", .name = "lf1")
lf2 <- lazy_frame(x = 1, y = "b", .name = "lf2")
lf3 <- lazy_frame(x = 1, z = "c", .name = "lf3")
lf1 |>
union(lf2) |>
union(lf3)
#> <SQL>
#> SELECT `lf1`.*, NULL AS `z`
#> FROM `lf1`
#>
#> UNION
#>
#> SELECT `lf2`.*, NULL AS `z`
#> FROM `lf2`
#>
#> UNION
#>
#> SELECT `x`, NULL AS `y`, `z`
#> FROM `lf3`
(As usual in these blog posts, I’m using
lazy_frame()
to focus on the SQL generation, without having to set up a dummy database.)
Similarly, a semi/anti_join()
on a filtered table now avoids a subquery:
lf1 |>
semi_join(lf3 |> filter(z == "c"), join_by(x))
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE EXISTS (
#> SELECT 1 FROM `lf3`
#> WHERE (`lf1`.`x` = `lf3`.`x`) AND (`lf3`.`z` = 'c')
#> )
SQL generation
The new argument sql_options
for
show_query()
and
remote_query()
gives you more control on the generated SQL.
-
By default dbplyr uses
*
to select all columns of a table, but withuse_star = FALSE
all columns are selected explicitly:lf3 <- lazy_frame(x = 1, y = 2, z = 3, .name = "lf3") lf3 |> mutate(a = 4) #> <SQL> #> SELECT `lf3`.*, 4.0 AS `a` #> FROM `lf3` lf3 |> mutate(a = 4) |> show_query(sql_options = sql_options(use_star = FALSE)) #> <SQL> #> SELECT `x`, `y`, `z`, 4.0 AS `a` #> FROM `lf3`
-
If you prefer common table expressions (CTE) over subqueries use
cte = TRUE
:nested_query <- lf3 |> mutate(z = z + 1) |> left_join(lf2, by = join_by(x, y)) nested_query #> <SQL> #> SELECT `LHS`.* #> FROM ( #> SELECT `x`, `y`, `z` + 1.0 AS `z` #> FROM `lf3` #> ) AS `LHS` #> LEFT JOIN `lf2` #> ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`) nested_query |> show_query(sql_options = sql_options(cte = TRUE)) #> <SQL> #> WITH `q01` AS ( #> SELECT `x`, `y`, `z` + 1.0 AS `z` #> FROM `lf3` #> ) #> SELECT `LHS`.* #> FROM `q01` AS `LHS` #> LEFT JOIN `lf2` #> ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)
-
And if you want that all columns in a join are qualified with the table name and not only the ambiguous ones use
qualify_all_columns = TRUE
:qualify_columns <- lf2 |> left_join(lf3, by = join_by(x, y)) qualify_columns #> <SQL> #> SELECT `lf2`.*, `z` #> FROM `lf2` #> LEFT JOIN `lf3` #> ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`) qualify_columns |> show_query(sql_options = sql_options(qualify_all_columns = TRUE)) #> <SQL> #> SELECT `lf2`.*, `lf3`.`z` AS `z` #> FROM `lf2` #> LEFT JOIN `lf3` #> ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)
New translations
str_detect()
, str_starts()
and str_ends()
with fixed patterns are translated to INSTR()
:
lf1 |>
filter(
stringr::str_detect(x, stringr::fixed("abc")),
stringr::str_starts(x, stringr::fixed("a"))
)
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE (INSTR(`x`, 'abc') > 0) AND (INSTR(`x`, 'a') = 1)
And
nzchar()
and
runif()
are now translated to their SQL equivalents:
lf1 |>
filter(nzchar(x)) |>
mutate(z = runif())
#> <SQL>
#> SELECT `lf1`.*, RANDOM() AS `z`
#> FROM `lf1`
#> WHERE (((`x` IS NULL) OR `x` != ''))
Acknowledgements
The vast majority of this release (particularly the SQL optimisations) are from Maximilian Girlich; thanks so much for continued work on this package! And a big thanks go to the 84 other folks who helped out by filing issues and contributing code: @abalter, @ablack3, @andreassoteriadesmoj, @apalacio9502, @avsdev-cw, @bairdj, @bastistician, @brownj31, @But2ene, @carlganz, @catalamarti, @CEH-SLU, @chriscardillo, @DavisVaughan, @DaZaM82, @donour, @edgararuiz, @eduardszoecs, @eipi10, @ejneer, @erikvona, @fh-afrachioni, @fh-mthomson, @gui-salome, @hadley, @halpo, @homer3018, @iangow, @jdlom, @jennal-datacenter, @JeremyPasco, @jiemakel, @jingydz, @johnbaums, @joshseiv, @jrandall, @khkk378, @kmishra9, @kongdd, @krlmlr, @krprasangdas, @KRRLP-PL, @lentinj, @lgaborini, @lhabegger, @lorenzolightsgdwarf, @lschneiderbauer, @marianschmidt, @matthewjnield, @mgirlich, @MichaelChirico, @misea, @mjbroerman, @moodymudskipper, @multimeric, @nannerhammix, @nikolasharing, @nviets, @nviraj, @oobd, @pboesu, @pepijn-devries, @rbcavanaugh, @rcepka, @robertkck, @samssann, @SayfSaid, @scottporter, @shearerpmm, @srikanthtist, @stemangiola, @stephenashton-dhsc, @stevepowell99, @TBlackmore, @thomashulst, @thothal, @tilo-aok, @tisseuil, @tonyk7440, @TSchiefer, @Tsemharb, @tuge98, @vadim-cherepanov, and @wdenton.