nanoparquet 0.4.0

We’re thrilled to announce the release of nanoparquet 0.4.0. nanoparquet is an R package that reads and writes Parquet files.

You can install it from CRAN with:

install.packages("nanoparquet")

This blog post will show the most important new features of nanoparquet 0.4.0: You can see a full list of changes in the release notes.

Brand new `read_parquet()`

nanoparquet 0.4.0 comes with a completely rewritten Parquet reader. The new version has an architecture that is easier to embed into R, and facilitates fantastic new features, like append_parquet() and the new col_select argument. (More to come!) The new reader is also much faster, see the “Benchmarks” chapter.

Read a subset of columns

read_parquet() now has a new argument called col_select, that lets you read a subset of the columns from the Parquet file. Unlike for row oriented file formats like CSV, this means that the reader never needs to touch the columns that are not needed for. The time required for reading a subset of columns is independent of how many more columns the Parquet file might have!

You can either use column indices or column names to specify the columns to read. Here is an example.

library(nanoparquet)
library(pillar)

This is the nycflights13::flights data set:

read_parquet(
  "flights.parquet",
  col_select = c("dep_time", "arr_time", "carrier")
)
#> # A data frame: 336,776 × 3
#>    dep_time arr_time carrier
#>       <int>    <int> <chr>  
#>  1      517      830 UA     
#>  2      533      850 UA     
#>  3      542      923 AA     
#>  4      544     1004 B6     
#>  5      554      812 DL     
#>  6      554      740 UA     
#>  7      555      913 B6     
#>  8      557      709 EV     
#>  9      557      838 B6     
#> 10      558      753 AA     
#> # ℹ 336,766 more rows

Use read_parquet_schema() if you want to see the structure of the Parquet file first:

read_parquet_schema("flights.parquet")
#> # A data frame: 20 × 12
#>    file_name       name  r_type type  type_length repetition_type converted_type
#>    <chr>           <chr> <chr>  <chr>       <int> <chr>           <chr>         
#>  1 flights.parquet sche… NA     NA             NA NA              NA            
#>  2 flights.parquet year  integ… INT32          NA REQUIRED        INT_32        
#>  3 flights.parquet month integ… INT32          NA REQUIRED        INT_32        
#>  4 flights.parquet day   integ… INT32          NA REQUIRED        INT_32        
#>  5 flights.parquet dep_… integ… INT32          NA OPTIONAL        INT_32        
#>  6 flights.parquet sche… integ… INT32          NA REQUIRED        INT_32        
#>  7 flights.parquet dep_… double DOUB…          NA OPTIONAL        NA            
#>  8 flights.parquet arr_… integ… INT32          NA OPTIONAL        INT_32        
#>  9 flights.parquet sche… integ… INT32          NA REQUIRED        INT_32        
#> 10 flights.parquet arr_… double DOUB…          NA OPTIONAL        NA            
#> 11 flights.parquet carr… chara… BYTE…          NA REQUIRED        UTF8          
#> 12 flights.parquet flig… integ… INT32          NA REQUIRED        INT_32        
#> 13 flights.parquet tail… chara… BYTE…          NA OPTIONAL        UTF8          
#> 14 flights.parquet orig… chara… BYTE…          NA REQUIRED        UTF8          
#> 15 flights.parquet dest  chara… BYTE…          NA REQUIRED        UTF8          
#> 16 flights.parquet air_… double DOUB…          NA OPTIONAL        NA            
#> 17 flights.parquet dist… double DOUB…          NA REQUIRED        NA            
#> 18 flights.parquet hour  double DOUB…          NA REQUIRED        NA            
#> 19 flights.parquet minu… double DOUB…          NA REQUIRED        NA            
#> 20 flights.parquet time… POSIX… INT64          NA REQUIRED        TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> #   precision <int>, field_id <int>

The output of read_parquet_schema() also shows you the R type that nanoparquet will use for each column.

Appending to Parquet files

The new append_parquet() function makes it easy to append new data to a Parquet file, without first reading the whole file into memory. The schema of the file and the schema new data must match of course. Lets merge nycflights13::flights and nycflights23::flights:

file.copy("flights.parquet", "allflights.parquet", overwrite = TRUE)
#> [1] TRUE
append_parquet(nycflights23::flights, "allflights.parquet")

read_parquet_info() returns the most basic information about a Parquet file:

read_parquet_info("flights.parquet")
#> # A data frame: 1 × 7
#>   file_name       num_cols num_rows num_row_groups file_size parquet_version
#>   <chr>              <int>    <dbl>          <int>     <dbl>           <int>
#> 1 flights.parquet       19   336776              1   5687737               1
#> # ℹ 1 more variable: created_by <chr>
read_parquet_info("allflights.parquet")
#> # A data frame: 1 × 7
#>   file_name          num_cols num_rows num_row_groups file_size parquet_version
#>   <chr>                 <int>    <dbl>          <int>     <dbl>           <int>
#> 1 allflights.parquet       19   772128              1  13490997               1
#> # ℹ 1 more variable: created_by <chr>

Note that you should probably still create a backup copy of the original file when using append_parquet(). If the appending process is interrupted by a power failure, then you might end up with an incomplete and invalid Parquet file.

Schemas and type conversions

In nanoparquet 0.4.0 write_parquet() takes a schema argument that can customize the R to Parquet type mappings. For example by default write_parquet() writes an R character vector as a STRING Parquet type. If you’d like to write a certain character column as an ENUM type¹ instead, you’ll need to specify that in schema:

write_parquet(
  nycflights13::flights,
  "newflights.parquet",
  schema = parquet_schema(carrier = "ENUM")
)
read_parquet_schema("newflights.parquet")
#> # A data frame: 20 × 12
#>    file_name       name  r_type type  type_length repetition_type converted_type
#>    <chr>           <chr> <chr>  <chr>       <int> <chr>           <chr>         
#>  1 newflights.par… sche… NA     NA             NA NA              NA            
#>  2 newflights.par… year  integ… INT32          NA REQUIRED        INT_32        
#>  3 newflights.par… month integ… INT32          NA REQUIRED        INT_32        
#>  4 newflights.par… day   integ… INT32          NA REQUIRED        INT_32        
#>  5 newflights.par… dep_… integ… INT32          NA OPTIONAL        INT_32        
#>  6 newflights.par… sche… integ… INT32          NA REQUIRED        INT_32        
#>  7 newflights.par… dep_… double DOUB…          NA OPTIONAL        NA            
#>  8 newflights.par… arr_… integ… INT32          NA OPTIONAL        INT_32        
#>  9 newflights.par… sche… integ… INT32          NA REQUIRED        INT_32        
#> 10 newflights.par… arr_… double DOUB…          NA OPTIONAL        NA            
#> 11 newflights.par… carr… chara… BYTE…          NA REQUIRED        ENUM          
#> 12 newflights.par… flig… integ… INT32          NA REQUIRED        INT_32        
#> 13 newflights.par… tail… chara… BYTE…          NA OPTIONAL        UTF8          
#> 14 newflights.par… orig… chara… BYTE…          NA REQUIRED        UTF8          
#> 15 newflights.par… dest  chara… BYTE…          NA REQUIRED        UTF8          
#> 16 newflights.par… air_… double DOUB…          NA OPTIONAL        NA            
#> 17 newflights.par… dist… double DOUB…          NA REQUIRED        NA            
#> 18 newflights.par… hour  double DOUB…          NA REQUIRED        NA            
#> 19 newflights.par… minu… double DOUB…          NA REQUIRED        NA            
#> 20 newflights.par… time… POSIX… INT64          NA REQUIRED        TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> #   precision <int>, field_id <int>

Here we wrote the carrier column as ENUM, and left the other other columns to use the default type mappings.

See the ?nanoparquet-types manual page for the possible type mappings (lots of new ones!) and also for the default ones.

Encodings

It is now also possible to customize the encoding of each column in write_parquet(), via the encoding argument. By default write_parquet() tries to choose a good encoding based on the type and the values of a column. E.g. it checks a small sample for repeated values to decide if it is worth using a dictionary encoding (RLE_DICTIONARY).

If write_parquet() gets it wrong, use the encoding argument to force an encoding. The following forces the PLAIN encoding for all columns. This encoding is very fast to write, but creates a larger file. You can also specify different encodings for different columns, see the write_parquet() manual page.

write_parquet(
  nycflights13::flights,
  "plainflights.parquet",
  encoding = "PLAIN"
)
file.size("flights.parquet")
#> [1] 5687737
file.size("plainflights.parquet")
#> [1] 11954574

See more about the implemented encodings and how the defaults are selected in the parquet-encodings manual page.

API changes

Some nanoparquet functions have new, better names in nanoparquet 0.4.0. In particular, all functions that read from a Parquet file have a read_parquet prefix now. The old functions still work, with a warning.

Also, the parquet_schema() function is now for creating a new Parquet schema from scratch, and not for inferring a schema from a data frame (use infer_parquet_schema()) or for reading the schema from a Parquet file (use read_parquet_schema()). parquet_schema() falls back to the old behaviour when called with a file name, with a warning, so this is not a breaking change (yet), and old code still works.

See the complete list of API changes in the Changelog.

Benchmarks

We are very excited about the performance of the new Parquet reader, and the Parquet writer was always quite speedy, so we ran a simple benchmark.

We compared nanoparquet to the Parquet implementations in Apache Arrow and DuckDB, and also to CSV readers and writers, on a real data set, for samples of 330k, 6.7 million and 67.4 million rows (40MB, 800MB and 8GB in memory). For these data nanoparquet is indeed very competitive with both Arrow and DuckDB.

You can see the full results on the website.

Other changes

Other important changes in nanoparquet 0.4.0 include:

write_parquet() can now write multiple row groups. By default it puts at most 10 million rows in a single row group. (This is subject to https://nanoparquet.r-lib.org/references/parquet_options.html ) on how to change the default.
write_parquet() now writes minimum and maximum statistics (by default) for most Parquet types. See the parquet_options() manual page on how to turn this off, which will probably make the writer faster.
write_parquet() can now write version 2 data pages. The default is still version 1, but it might change in the future.
New compression_level option to select the compression level manually.
read_parquet() can now read from an R connection.

Acknowledgements

@alvarocombo, @D3SL, @gaborcsardi, and @RealTYPICAL.

A Parquet ENUM type is very similar to a factor in R. ↩︎