I’m very excited to announce that bigrquery 1.0.0 is now on CRAN. This package makes it easy to work with data stored in Google BigQuery, a hosted database for big data. The bigrquery package provides three levels of abstraction on top of BigQuery:
The low-level API provides thin wrappers over the underlying REST API. In this version, all the low-level functions start with
bq_
, and mostly have the formbq_noun_verb()
. This level of abstraction is most appropriate if you’re familiar with the REST API and you want do something not supported in the higher-level APIs.The DBI interface wraps the low-level API and makes working with BigQuery like working with any other database system. This is the most convenient layer if you want to execute SQL queries in BigQuery or upload smaller amounts (i.e. <100 MB) of data.
The dplyr interface lets you treat BigQuery tables as if they are in-memory data frames. This is the most convenient layer if you don’t want to write SQL, but instead want dbplyr to write it for you.
Install it with:
install.packages("bigrquery")
Four big changes in this version of bigrquery are described in detail below:
- Support for repeated and nested fields.
- Easier cross-dataset queries.
- Greatly improved download speeds.
- A new low-level API.
There are also many smaller improvements and bug fixes, as described in the release notes.
Nested and repeated fields
One of the neatest things about BigQuery is that it supports nested and repeated fields, which are also called structs (or records) and arrays. bigrquery now supports those types of fields, reading them into list-columns:
- Repeated values become list-columns containing vectors.
- Nested values become list-columns containing named lists.
- Repeated nested values become list-columns containing data frames.
The following code illustrates the output for the two most important types: an array, and an array of structs:
library(bigrquery)
con <- DBI::dbConnect(bigquery(), project = bq_test_project())
sql <- "SELECT
[1, 2, 3] as list,
[STRUCT(1 as a, 'a' as b), STRUCT(2, 'b'), STRUCT(3, 'c')] as df
"
out <- DBI::dbGetQuery(con, sql)
#> Auto-refreshing stale OAuth token.
out
#> # A tibble: 1 x 2
#> list df
#> <list> <list>
#> 1 <int [3]> <tibble [3 × 2]>
out$list[[1]]
#> [1] 1 2 3
out$df[[1]]
#> # A tibble: 3 x 2
#> a b
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
Note that results are now returned as tibbles, not data frames, because the base print method does not handle list columns well. If for some reason you do need a data frame, use as.data.frame()
to convert back.
This work has made me think a bunch about list-colums and df-columns (which might be a better fit for non-repeated nested fields). The germination of these ideas is likely to have widespread (if subtle) infuence throughout the tidyverse, with initial impacts most likely to be felt in tidyr.
Cross-dataset queries
In the previous version of bigrquery it was difficult to perform queries across datasets because dbConnect()
forced you to specify a dataset
and only allowed you to reference tables within that dataset. Now the dataset is optional (the only required argument to DBI::dbConnect()
is a project to bill) and both DBI and dplyr interfaces accept qualified table names: either dataset.table
or project.dataset.table
.
The following example demonstrates with my test project, which contains a basedata
dataset containing the mtcars
table:
library(bigrquery)
con <- DBI::dbConnect(bigquery(), project = bq_test_project())
mtcars1 <- DBI::dbReadTable(con, "basedata.mtcars")
head(mtcars1)
#> # A tibble: 6 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 24.4 4. 147. 62. 3.69 3.19 20.0 1. 0. 4. 2.
#> 2 21.5 4. 120. 97. 3.70 2.46 20.0 1. 0. 3. 1.
#> 3 18.1 6. 225. 105. 2.76 3.46 20.2 1. 0. 3. 1.
#> 4 21.4 6. 258. 110. 3.08 3.22 19.4 1. 0. 3. 1.
#> 5 15.2 8. 276. 180. 3.07 3.78 18.0 0. 0. 3. 3.
#> 6 17.3 8. 276. 180. 3.07 3.73 17.6 0. 0. 3. 3.
mtcars2 <- dplyr::tbl(con, "basedata.mtcars")
head(mtcars2)
#> # Source: lazy query [?? x 11]
#> # Database: BigQueryConnection
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 24.4 4. 147. 62. 3.69 3.19 20.0 1. 0. 4. 2.
#> 2 21.5 4. 120. 97. 3.70 2.46 20.0 1. 0. 3. 1.
#> 3 18.1 6. 225. 105. 2.76 3.46 20.2 1. 0. 3. 1.
#> 4 21.4 6. 258. 110. 3.08 3.22 19.4 1. 0. 3. 1.
#> 5 15.2 8. 276. 180. 3.07 3.78 18.0 0. 0. 3. 3.
#> 6 17.3 8. 276. 180. 3.07 3.73 17.6 0. 0. 3. 3.
Improved download speeds
The system for downloading data from BigQuery into R has been rewritten from the ground up to considerably improve performance:
By default, data is downloaded from BigQuery in pages of 10,000 rows. Previously, bigrquery downloaded then parsed each page. Now, bigrquery downloads all pages, then parses all pages. This means that you’ll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.
Because all pages are downloaded in a single pass, we can now download in parallel, using up to 6 simultaneous connections by default. This generally doesn’t result in a six-fold speed up, but should at least double download speed.
The parsing code has been rewritten in C++. This eliminates several expensive intermediate computations, and means that bigrquery no longer requires readr.
All up, I can now download the first million rows of publicdata.samples.natality
in about a minute, about 8x faster than the previous version. This data frame takes up 170 MB of space in BigQuery and 140 MB of memory in R, so a minute to download doesn’t seem unreasonable. The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson. If this is still too slow (because you download a lot of data), see ?bq_table_download
for an alternative approach.
Low-level API
The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by the higher-level DBI and dplyr interfaces.
Consistent naming scheme: All API functions now have the form
bq_object_verb()
, e.g.bq_table_create()
, orbq_dataset_delete()
.S3 classes:
bq_table()
,bq_dataset()
,bq_job()
,bq_field()
andbq_fields()
constructors create S3 objects corresponding to important BigQuery objects. T hese are paired withas_
coercion functions and used throughout the new API.Easier local testing: New
bq_test_project()
andbq_test_dataset()
make it easier to run bigrquery tests locally. To run the tests yourself, you need to create a BigQuery project, and then follow the instructions in?bq_test_project
.More efficient data transfer: The new API makes extensive use of the
fields
query parameter, ensuring that functions only download data that they actually use.Tighter GCS connections: New
bq_table_load()
loads data from a Google Cloud Storage URI, pairing withbq_table_save()
which saves data to a GCS URI.
The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.
Acknowledgements
A big thanks goes out to all 41 users who contributed issues, pull requests, and comments since the last release: @alex-danilin, @aschwartzSGI, @aumdavis, @barnettjacob, @batpigandme, @blakeyc, @bulam, @byapparov, @c3212218, @craigcitro, @czeildi, @dan-booth, @edgararuiz, @EricGoldsmith, @everron, @haavardw, @hadley, @hidekoji, @inkrement, @j450h1, @jarodmeng, @jennybc, @JHowix, @krlmlr, @ldanai, @leggitta, @michaelquinn32, @mpancia, @ned2, @pcejrowski, @Praxiteles, @ras44, @realAkhmed, @robincrlee, @selesnow, @TerryZhangHL, @VictArt, @vivshume, @zippeurfou, @zkostitsyn, and @Zsedo