We’re thrilled to announce the release of nanoparquet 0.4.0. nanoparquet is an R package that reads and writes Parquet files.
You can install it from CRAN with:
install.packages("nanoparquet")
This blog post will show the most important new features of nanoparquet 0.4.0: You can see a full list of changes in the release notes.
Brand new read_parquet()
nanoparquet 0.4.0 comes with a completely rewritten Parquet reader. The new version has an architecture that is easier to embed into R, and facilitates fantastic new features, like
append_parquet()
and the new col_select
argument. (More to come!) The new reader is also much faster, see the “Benchmarks” chapter.
Read a subset of columns
read_parquet()
now has a new argument called col_select
, that lets you read a subset of the columns from the Parquet file. Unlike for row oriented file formats like CSV, this means that the reader never needs to touch the columns that are not needed for. The time required for reading a subset of columns is independent of how many more columns the Parquet file might have!
You can either use column indices or column names to specify the columns to read. Here is an example.
This is the
nycflights13::flights
data set:
read_parquet(
"flights.parquet",
col_select = c("dep_time", "arr_time", "carrier")
)
#> # A data frame: 336,776 × 3
#> dep_time arr_time carrier
#> <int> <int> <chr>
#> 1 517 830 UA
#> 2 533 850 UA
#> 3 542 923 AA
#> 4 544 1004 B6
#> 5 554 812 DL
#> 6 554 740 UA
#> 7 555 913 B6
#> 8 557 709 EV
#> 9 557 838 B6
#> 10 558 753 AA
#> # ℹ 336,766 more rows
Use
read_parquet_schema()
if you want to see the structure of the Parquet file first:
read_parquet_schema("flights.parquet")
#> # A data frame: 20 × 12
#> file_name name r_type type type_length repetition_type converted_type
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 flights.parquet sche… NA NA NA NA NA
#> 2 flights.parquet year integ… INT32 NA REQUIRED INT_32
#> 3 flights.parquet month integ… INT32 NA REQUIRED INT_32
#> 4 flights.parquet day integ… INT32 NA REQUIRED INT_32
#> 5 flights.parquet dep_… integ… INT32 NA OPTIONAL INT_32
#> 6 flights.parquet sche… integ… INT32 NA REQUIRED INT_32
#> 7 flights.parquet dep_… double DOUB… NA OPTIONAL NA
#> 8 flights.parquet arr_… integ… INT32 NA OPTIONAL INT_32
#> 9 flights.parquet sche… integ… INT32 NA REQUIRED INT_32
#> 10 flights.parquet arr_… double DOUB… NA OPTIONAL NA
#> 11 flights.parquet carr… chara… BYTE… NA REQUIRED UTF8
#> 12 flights.parquet flig… integ… INT32 NA REQUIRED INT_32
#> 13 flights.parquet tail… chara… BYTE… NA OPTIONAL UTF8
#> 14 flights.parquet orig… chara… BYTE… NA REQUIRED UTF8
#> 15 flights.parquet dest chara… BYTE… NA REQUIRED UTF8
#> 16 flights.parquet air_… double DOUB… NA OPTIONAL NA
#> 17 flights.parquet dist… double DOUB… NA REQUIRED NA
#> 18 flights.parquet hour double DOUB… NA REQUIRED NA
#> 19 flights.parquet minu… double DOUB… NA REQUIRED NA
#> 20 flights.parquet time… POSIX… INT64 NA REQUIRED TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> # precision <int>, field_id <int>
The output of
read_parquet_schema()
also shows you the R type that nanoparquet will use for each column.
Appending to Parquet files
The new
append_parquet()
function makes it easy to append new data to a Parquet file, without first reading the whole file into memory. The schema of the file and the schema new data must match of course. Lets merge
nycflights13::flights
and
nycflights23::flights
:
file.copy("flights.parquet", "allflights.parquet", overwrite = TRUE)
#> [1] TRUE
append_parquet(nycflights23::flights, "allflights.parquet")
read_parquet_info()
returns the most basic information about a Parquet file:
read_parquet_info("flights.parquet")
#> # A data frame: 1 × 7
#> file_name num_cols num_rows num_row_groups file_size parquet_version
#> <chr> <int> <dbl> <int> <dbl> <int>
#> 1 flights.parquet 19 336776 1 5687737 1
#> # ℹ 1 more variable: created_by <chr>
read_parquet_info("allflights.parquet")
#> # A data frame: 1 × 7
#> file_name num_cols num_rows num_row_groups file_size parquet_version
#> <chr> <int> <dbl> <int> <dbl> <int>
#> 1 allflights.parquet 19 772128 1 13490997 1
#> # ℹ 1 more variable: created_by <chr>
Note that you should probably still create a backup copy of the original file when using
append_parquet()
. If the appending process is interrupted by a power failure, then you might end up with an incomplete and invalid Parquet file.
Schemas and type conversions
In nanoparquet 0.4.0
write_parquet()
takes a schema
argument that can customize the R to Parquet type mappings. For example by default
write_parquet()
writes an R character vector as a STRING
Parquet type. If you’d like to write a certain character column as an ENUM
type1 instead, you’ll need to specify that in schema
:
write_parquet(
nycflights13::flights,
"newflights.parquet",
schema = parquet_schema(carrier = "ENUM")
)
read_parquet_schema("newflights.parquet")
#> # A data frame: 20 × 12
#> file_name name r_type type type_length repetition_type converted_type
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 newflights.par… sche… NA NA NA NA NA
#> 2 newflights.par… year integ… INT32 NA REQUIRED INT_32
#> 3 newflights.par… month integ… INT32 NA REQUIRED INT_32
#> 4 newflights.par… day integ… INT32 NA REQUIRED INT_32
#> 5 newflights.par… dep_… integ… INT32 NA OPTIONAL INT_32
#> 6 newflights.par… sche… integ… INT32 NA REQUIRED INT_32
#> 7 newflights.par… dep_… double DOUB… NA OPTIONAL NA
#> 8 newflights.par… arr_… integ… INT32 NA OPTIONAL INT_32
#> 9 newflights.par… sche… integ… INT32 NA REQUIRED INT_32
#> 10 newflights.par… arr_… double DOUB… NA OPTIONAL NA
#> 11 newflights.par… carr… chara… BYTE… NA REQUIRED ENUM
#> 12 newflights.par… flig… integ… INT32 NA REQUIRED INT_32
#> 13 newflights.par… tail… chara… BYTE… NA OPTIONAL UTF8
#> 14 newflights.par… orig… chara… BYTE… NA REQUIRED UTF8
#> 15 newflights.par… dest chara… BYTE… NA REQUIRED UTF8
#> 16 newflights.par… air_… double DOUB… NA OPTIONAL NA
#> 17 newflights.par… dist… double DOUB… NA REQUIRED NA
#> 18 newflights.par… hour double DOUB… NA REQUIRED NA
#> 19 newflights.par… minu… double DOUB… NA REQUIRED NA
#> 20 newflights.par… time… POSIX… INT64 NA REQUIRED TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> # precision <int>, field_id <int>
Here we wrote the carrier
column as ENUM
, and left the other other columns to use the default type mappings.
See the
?nanoparquet-types
manual page for the possible type mappings (lots of new ones!) and also for the default ones.
Encodings
It is now also possible to customize the encoding of each column in
write_parquet()
, via the encoding
argument. By default
write_parquet()
tries to choose a good encoding based on the type and the values of a column. E.g. it checks a small sample for repeated values to decide if it is worth using a dictionary encoding (RLE_DICTIONARY
).
If
write_parquet()
gets it wrong, use the encoding
argument to force an encoding. The following forces the PLAIN
encoding for all columns. This encoding is very fast to write, but creates a larger file. You can also specify different encodings for different columns, see the
write_parquet()
manual page.
write_parquet(
nycflights13::flights,
"plainflights.parquet",
encoding = "PLAIN"
)
file.size("flights.parquet")
#> [1] 5687737
file.size("plainflights.parquet")
#> [1] 11954574
See more about the implemented encodings and how the defaults are selected in the
parquet-encodings
manual page.
API changes
Some nanoparquet functions have new, better names in nanoparquet 0.4.0. In particular, all functions that read from a Parquet file have a read_parquet
prefix now. The old functions still work, with a warning.
Also, the
parquet_schema()
function is now for creating a new Parquet schema from scratch, and not for inferring a schema from a data frame (use
infer_parquet_schema()
) or for reading the schema from a Parquet file (use
read_parquet_schema()
).
parquet_schema()
falls back to the old behaviour when called with a file name, with a warning, so this is not a breaking change (yet), and old code still works.
See the complete list of API changes in the Changelog.
Benchmarks
We are very excited about the performance of the new Parquet reader, and the Parquet writer was always quite speedy, so we ran a simple benchmark.
We compared nanoparquet to the Parquet implementations in Apache Arrow and DuckDB, and also to CSV readers and writers, on a real data set, for samples of 330k, 6.7 million and 67.4 million rows (40MB, 800MB and 8GB in memory). For these data nanoparquet is indeed very competitive with both Arrow and DuckDB.
You can see the full results on the website.
Other changes
Other important changes in nanoparquet 0.4.0 include:
-
write_parquet()
can now write multiple row groups. By default it puts at most 10 million rows in a single row group. (This is subject to https://nanoparquet.r-lib.org/references/parquet_options.html ) on how to change the default. -
write_parquet()
now writes minimum and maximum statistics (by default) for most Parquet types. See theparquet_options()
manual page on how to turn this off, which will probably make the writer faster. -
write_parquet()
can now write version 2 data pages. The default is still version 1, but it might change in the future. -
New
compression_level
option to select the compression level manually. -
read_parquet()
can now read from an R connection.
Acknowledgements
@alvarocombo, @D3SL, @gaborcsardi, and @RealTYPICAL.
-
A Parquet
ENUM
type is very similar to a factor in R. ↩︎