bigrquery 1.0.0

  dbplyr, bigrquery, databases

  Hadley Wickham

I’m very excited to announce that bigrquery 1.0.0 is now on CRAN. This package makes it easy to work with data stored in Google BigQuery, a hosted database for big data. The bigrquery package provides three levels of abstraction on top of BigQuery:

  • The low-level API provides thin wrappers over the underlying REST API. In this version, all the low-level functions start with bq_, and mostly have the form bq_noun_verb(). This level of abstraction is most appropriate if you’re familiar with the REST API and you want do something not supported in the higher-level APIs.

  • The DBI interface wraps the low-level API and makes working with BigQuery like working with any other database system. This is the most convenient layer if you want to execute SQL queries in BigQuery or upload smaller amounts (i.e. <100 MB) of data.

  • The dplyr interface lets you treat BigQuery tables as if they are in-memory data frames. This is the most convenient layer if you don’t want to write SQL, but instead want dbplyr to write it for you.

Install it with:

install.packages("bigrquery")

Four big changes in this version of bigrquery are described in detail below:

  • Support for repeated and nested fields.
  • Easier cross-dataset queries.
  • Greatly improved download speeds.
  • A new low-level API.

There are also many smaller improvements and bug fixes, as described in the release notes.

Nested and repeated fields

One of the neatest things about BigQuery is that it supports nested and repeated fields, which are also called structs (or records) and arrays. bigrquery now supports those types of fields, reading them into list-columns:

  • Repeated values become list-columns containing vectors.
  • Nested values become list-columns containing named lists.
  • Repeated nested values become list-columns containing data frames.

The following code illustrates the output for the two most important types: an array, and an array of structs:

library(bigrquery)

con <- DBI::dbConnect(bigquery(), project = bq_test_project())
sql <- "SELECT 
  [1, 2, 3] as list,
  [STRUCT(1 as a, 'a' as b), STRUCT(2, 'b'), STRUCT(3, 'c')] as df
"
out <- DBI::dbGetQuery(con, sql)
#> Auto-refreshing stale OAuth token.
out
#> # A tibble: 1 x 2
#>   list      df              
#>   <list>    <list>          
#> 1 <int [3]> <tibble [3 × 2]>

out$list[[1]]
#> [1] 1 2 3

out$df[[1]]
#> # A tibble: 3 x 2
#>       a b    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c

Note that results are now returned as tibbles, not data frames, because the base print method does not handle list columns well. If for some reason you do need a data frame, use as.data.frame() to convert back.

This work has made me think a bunch about list-colums and df-columns (which might be a better fit for non-repeated nested fields). The germination of these ideas is likely to have widespread (if subtle) infuence throughout the tidyverse, with initial impacts most likely to be felt in tidyr.

Cross-dataset queries

In the previous version of bigrquery it was difficult to perform queries across datasets because dbConnect() forced you to specify a dataset and only allowed you to reference tables within that dataset. Now the dataset is optional (the only required argument to DBI::dbConnect() is a project to bill) and both DBI and dplyr interfaces accept qualified table names: either dataset.table or project.dataset.table.

The following example demonstrates with my test project, which contains a basedata dataset containing the mtcars table:

library(bigrquery)
con <- DBI::dbConnect(bigquery(), project = bq_test_project())

mtcars1 <- DBI::dbReadTable(con, "basedata.mtcars")
head(mtcars1)
#> # A tibble: 6 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  24.4    4.  147.   62.  3.69  3.19  20.0    1.    0.    4.    2.
#> 2  21.5    4.  120.   97.  3.70  2.46  20.0    1.    0.    3.    1.
#> 3  18.1    6.  225.  105.  2.76  3.46  20.2    1.    0.    3.    1.
#> 4  21.4    6.  258.  110.  3.08  3.22  19.4    1.    0.    3.    1.
#> 5  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.    3.
#> 6  17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.    3.

mtcars2 <- dplyr::tbl(con, "basedata.mtcars")
head(mtcars2)
#> # Source:   lazy query [?? x 11]
#> # Database: BigQueryConnection
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  24.4    4.  147.   62.  3.69  3.19  20.0    1.    0.    4.    2.
#> 2  21.5    4.  120.   97.  3.70  2.46  20.0    1.    0.    3.    1.
#> 3  18.1    6.  225.  105.  2.76  3.46  20.2    1.    0.    3.    1.
#> 4  21.4    6.  258.  110.  3.08  3.22  19.4    1.    0.    3.    1.
#> 5  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.    3.
#> 6  17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.    3.

Improved download speeds

The system for downloading data from BigQuery into R has been rewritten from the ground up to considerably improve performance:

  • By default, data is downloaded from BigQuery in pages of 10,000 rows. Previously, bigrquery downloaded then parsed each page. Now, bigrquery downloads all pages, then parses all pages. This means that you’ll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.

  • Because all pages are downloaded in a single pass, we can now download in parallel, using up to 6 simultaneous connections by default. This generally doesn’t result in a six-fold speed up, but should at least double download speed.

  • The parsing code has been rewritten in C++. This eliminates several expensive intermediate computations, and means that bigrquery no longer requires readr.

All up, I can now download the first million rows of publicdata.samples.natality in about a minute, about 8x faster than the previous version. This data frame takes up 170 MB of space in BigQuery and 140 MB of memory in R, so a minute to download doesn’t seem unreasonable. The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson. If this is still too slow (because you download a lot of data), see ?bq_table_download for an alternative approach.

Low-level API

The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by the higher-level DBI and dplyr interfaces.

  • Consistent naming scheme: All API functions now have the form bq_object_verb(), e.g. bq_table_create(), or bq_dataset_delete().

  • S3 classes: bq_table(), bq_dataset(), bq_job(), bq_field() and bq_fields() constructors create S3 objects corresponding to important BigQuery objects. T hese are paired with as_ coercion functions and used throughout the new API.

  • Easier local testing: New bq_test_project() and bq_test_dataset() make it easier to run bigrquery tests locally. To run the tests yourself, you need to create a BigQuery project, and then follow the instructions in ?bq_test_project.

  • More efficient data transfer: The new API makes extensive use of the fields query parameter, ensuring that functions only download data that they actually use.

  • Tighter GCS connections: New bq_table_load() loads data from a Google Cloud Storage URI, pairing with bq_table_save() which saves data to a GCS URI.

The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.