nanoparquet 0.3.0

  parquet

  Gábor Csárdi

We’re extremely pleased to announce the release of nanoparquet 0.3.0. nanoparquet is a new R package that reads Parquet files into data frames, and writes data frames to Parquet files.

You can install it from CRAN with:

install.packages("nanoparquet")

This blog post will cover the features and limitations of nanoparquet, and also our future plans.

What is Parquet?

Parquet is a file format for storing data on disk. It is specifically designed for large data sets, read-heavy workloads and data analysis. The most important features of Parquet are:

  • Columnar. Data is stored column-wise, so whole columns (or large chunks of columns) are easy to read quickly. Columnar storage allows better compression, fast operations on a subset of columns, and easy ways of removing columns or adding new columns to a data file.

  • Binary. A Parquet file is not a text file. Each Parquet data type is stored in a well-defined binary storage format, leaving no ambiguity about how fields are persed.

  • Rich types. Parquet supports a small set of low level data types with well specified storage formats and encodings. On top of the low level types, it implemented several higher level logical types, like UTF-8 strings, time stamps, JSON strings, ENUM types (factors), etc.

  • Well supported. At this point Parquet is well supported across modern languages like R, Python, Rust, Java, Go, etc. In particular, Apache Arrow handles Parquet files very well, and has bindings to many languages. DuckDB is a very portable tool that reads and writes Parquet files, or even opens a set of Parquet files as a database.

  • Performant. Parquet columns may use various encodings and compression to ensure that the data files are kept as small as possible. When running an analytical query on the subset of the data, the Parquet format makes it easy to skip the columns and/or rows that are irrelevant.

  • Concurrency built in. A Parquet file can be divided into row groups. Parquet readers can read, uncompress and decode row groups in parallel. Parquet writes can encode and compress row groups in parallel. Even a single column may be divided into multiple pages, that can be (un)compressed, encode and decode in parallel.

  • Missing values. Support for missing values is built into the Parquet format.

Why we created nanoparquet?

Although Parquet is well supported by modern languages, today the complexity of the Parquet format often outweighs its benefits for smaller data sets. Many tools that support Parquet are typically used for larger, out of memory data sets, so there is a perception that Parquet is only for big data. These tools typically take longer to compile or install, and often seem too heavy for in-memory data analysis.

With nanoparquet, we wanted to have a smaller tool that has no dependencies and is easy to install. Our goal is to facilitate the adoption of Parquet for smaller data sets, especially for teams that share data between multiple environments, e.g. R, Python, Java, etc.

nanoparquet Features

These are some of the nanoparquet features that we are most excited about.

  • Lightweight. nanoparquet has no package or system dependencies other than a C++-11 compiler. It compiles in about 30 seconds into an R package that is less than a megabyte in size.

  • Reads many Parquet files. nanoparquet::read_parquet() supports reading most Parquet files. In particular, in supports all Parquet encodings and at the time of writing it supports three compression codecs: Snappy, Gzip and Zstd. Make sure you read “Limitations” below.

  • Writes many R data types. nanoparquet::write_parquet() supports writing most R data frames. In particular, missing values are handled properly, factor columns are kept as factors, and temporal types are encoded correctly. Make sure you read “Limitations” below.

  • Type mappings. nanoparquet has a well defined set of type mapping rules. Use the parquet_column_types() function to see how read_parquet() and write_parquet() maps Parquet and R types for a file or a data frame.

  • Metadata queries. nanoparquet has a number of functions that allow you to query the metadata and schema without reading in the full dataset.

Examples

Reading a Parquet file

The nanoparquet R package contains an example Parquet file. We are going to use it to demonstrate how the package works.

If the pillar package is loaded, then nanoparquet data frames are pretty-printed.

library(nanoparquet)
library(pillar)
udf <- system.file("extdata/userdata1.parquet", package = "nanoparquet")

Before actually reading the file, let’s look up some metadata about it, and also how its columns will be mapped to R types:

parquet_info(udf)
#> # A data frame: 1 × 7
#>   file_name           num_cols num_rows num_row_groups file_size parquet_version
#>   <chr>                  <int>    <dbl>          <int>     <dbl>           <int>
#> 1 /Users/gaborcsardi…       13     1000              1     73217               1
#> # ℹ 1 more variable: created_by <chr>
parquet_column_types(udf)
#> # A data frame: 13 × 6
#>    file_name        name  type  r_type repetition_type logical_type             
#>  * <chr>            <chr> <chr> <chr>  <chr>           <I<list>>                
#>  1 /Users/gaborcsa… regi… INT64 POSIX… REQUIRED        <TIMESTAMP(TRUE, micros)>
#>  2 /Users/gaborcsa… id    INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  3 /Users/gaborcsa… firs… BYTE… chara… OPTIONAL        <STRING>                 
#>  4 /Users/gaborcsa… last… BYTE… chara… REQUIRED        <STRING>                 
#>  5 /Users/gaborcsa… email BYTE… chara… OPTIONAL        <STRING>                 
#>  6 /Users/gaborcsa… gend… BYTE… factor OPTIONAL        <STRING>                 
#>  7 /Users/gaborcsa… ip_a… BYTE… chara… REQUIRED        <STRING>                 
#>  8 /Users/gaborcsa… cc    BYTE… chara… OPTIONAL        <STRING>                 
#>  9 /Users/gaborcsa… coun… BYTE… chara… REQUIRED        <STRING>                 
#> 10 /Users/gaborcsa… birt… INT32 Date   OPTIONAL        <DATE>                   
#> 11 /Users/gaborcsa… sala… DOUB… double OPTIONAL        <NULL>                   
#> 12 /Users/gaborcsa… title BYTE… chara… OPTIONAL        <STRING>                 
#> 13 /Users/gaborcsa… comm… BYTE… chara… OPTIONAL        <STRING>

For every Parquet column we see its low level Parquet data type in type, e.g. INT64 or BYTE_ARRAY. r_type the R type that read_parquet() will create for that column. If repetition_type is REQUIRED, then that column cannot contain missing values. OPTIONAL columns may have missing values. logical_type is the higher level Parquet data type.

E.g. the first column is an UTC (because of the TRUE) timestamp, in microseconds. It is stored as a 64 bit integer in the file, and it will be converted to a POSIXct object by read_parquet().

To actually read the file into a data frame, call read_parquet():

ud1 <- read_parquet(udf)
ud1
#> # A data frame: 1,000 × 13
#>    registration           id first_name last_name email  gender ip_address cc   
#>    <dttm>              <int> <chr>      <chr>     <chr>  <fct>  <chr>      <chr>
#>  1 2016-02-03 07:55:29     1 Amanda     Jordan    ajord… Female 1.197.201… 6759…
#>  2 2016-02-03 17:04:03     2 Albert     Freeman   afree… Male   218.111.1… NA   
#>  3 2016-02-03 01:09:31     3 Evelyn     Morgan    emorg… Female 7.161.136… 6767…
#>  4 2016-02-03 00:36:21     4 Denise     Riley     drile… Female 140.35.10… 3576…
#>  5 2016-02-03 05:05:31     5 Carlos     Burns     cburn… NA     169.113.2… 5602…
#>  6 2016-02-03 07:22:34     6 Kathryn    White     kwhit… Female 195.131.8… 3583…
#>  7 2016-02-03 08:33:08     7 Samuel     Holmes    sholm… Male   232.234.8… 3582…
#>  8 2016-02-03 06:47:06     8 Harry      Howell    hhowe… Male   91.235.51… NA   
#>  9 2016-02-03 03:52:53     9 Jose       Foster    jfost… Male   132.31.53… NA   
#> 10 2016-02-03 18:29:47    10 Emily      Stewart   estew… Female 143.28.25… 3574…
#> # ℹ 990 more rows
#> # ℹ 5 more variables: country <chr>, birthdate <date>, salary <dbl>,
#> #   title <chr>, comments <chr>

Writing a Parquet file

To show write_parquet(), we’ll use the flights data in the nycflights13 package:

library(nycflights13)
flights
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

First we check how columns of flights will be mapped to Parquet types:

parquet_column_types(flights)
#> # A data frame: 19 × 6
#>    file_name name         type  r_type repetition_type logical_type             
#>    <chr>     <chr>        <chr> <chr>  <chr>           <I<list>>                
#>  1 NA        year         INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  2 NA        month        INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  3 NA        day          INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  4 NA        dep_time     INT32 integ… OPTIONAL        <INT(32, TRUE)>          
#>  5 NA        sched_dep_t… INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  6 NA        dep_delay    DOUB… double OPTIONAL        <NULL>                   
#>  7 NA        arr_time     INT32 integ… OPTIONAL        <INT(32, TRUE)>          
#>  8 NA        sched_arr_t… INT32 integ… REQUIRED        <INT(32, TRUE)>          
#>  9 NA        arr_delay    DOUB… double OPTIONAL        <NULL>                   
#> 10 NA        carrier      BYTE… chara… REQUIRED        <STRING>                 
#> 11 NA        flight       INT32 integ… REQUIRED        <INT(32, TRUE)>          
#> 12 NA        tailnum      BYTE… chara… OPTIONAL        <STRING>                 
#> 13 NA        origin       BYTE… chara… REQUIRED        <STRING>                 
#> 14 NA        dest         BYTE… chara… REQUIRED        <STRING>                 
#> 15 NA        air_time     DOUB… double OPTIONAL        <NULL>                   
#> 16 NA        distance     DOUB… double REQUIRED        <NULL>                   
#> 17 NA        hour         DOUB… double REQUIRED        <NULL>                   
#> 18 NA        minute       DOUB… double REQUIRED        <NULL>                   
#> 19 NA        time_hour    INT64 POSIX… REQUIRED        <TIMESTAMP(TRUE, micros)>

This looks fine, so we go ahead and write out the file. By default it will be Snappy-compressed, and many columns will be dictionary encoded.

write_parquet(flights, "flights.parquet")

Parquet metadata

Use parquet_schema() to see the schema of a Parquet file. The schema also includes “internal” parquet columns. Every Parquet file is a tree where columns may be part of an “internal” column. nanoparquet currently only supports flat files, that consist of a single internal root column and all other columns are leaf columns and are children of the root:

parquet_schema("flights.parquet")
#> # A data frame: 20 × 11
#>    file_name       name         type  type_length repetition_type converted_type
#>    <chr>           <chr>        <chr>       <int> <chr>           <chr>         
#>  1 flights.parquet schema       NA             NA NA              NA            
#>  2 flights.parquet year         INT32          NA REQUIRED        INT_32        
#>  3 flights.parquet month        INT32          NA REQUIRED        INT_32        
#>  4 flights.parquet day          INT32          NA REQUIRED        INT_32        
#>  5 flights.parquet dep_time     INT32          NA OPTIONAL        INT_32        
#>  6 flights.parquet sched_dep_t… INT32          NA REQUIRED        INT_32        
#>  7 flights.parquet dep_delay    DOUB…          NA OPTIONAL        NA            
#>  8 flights.parquet arr_time     INT32          NA OPTIONAL        INT_32        
#>  9 flights.parquet sched_arr_t… INT32          NA REQUIRED        INT_32        
#> 10 flights.parquet arr_delay    DOUB…          NA OPTIONAL        NA            
#> 11 flights.parquet carrier      BYTE…          NA REQUIRED        UTF8          
#> 12 flights.parquet flight       INT32          NA REQUIRED        INT_32        
#> 13 flights.parquet tailnum      BYTE…          NA OPTIONAL        UTF8          
#> 14 flights.parquet origin       BYTE…          NA REQUIRED        UTF8          
#> 15 flights.parquet dest         BYTE…          NA REQUIRED        UTF8          
#> 16 flights.parquet air_time     DOUB…          NA OPTIONAL        NA            
#> 17 flights.parquet distance     DOUB…          NA REQUIRED        NA            
#> 18 flights.parquet hour         DOUB…          NA REQUIRED        NA            
#> 19 flights.parquet minute       DOUB…          NA REQUIRED        NA            
#> 20 flights.parquet time_hour    INT64          NA REQUIRED        TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> #   precision <int>, field_id <int>

To see more information about a Parquet file, use parquet_metadata():

parquet_metadata("flights.parquet")
#> $file_meta_data
#> # A data frame: 1 × 5
#>   file_name       version num_rows key_value_metadata created_by                
#>   <chr>             <int>    <dbl> <I<list>>          <chr>                     
#> 1 flights.parquet       1   336776 <tbl [1 × 2]>      https://github.com/gaborc…
#> 
#> $schema
#> # A data frame: 20 × 11
#>    file_name       name         type  type_length repetition_type converted_type
#>    <chr>           <chr>        <chr>       <int> <chr>           <chr>         
#>  1 flights.parquet schema       NA             NA NA              NA            
#>  2 flights.parquet year         INT32          NA REQUIRED        INT_32        
#>  3 flights.parquet month        INT32          NA REQUIRED        INT_32        
#>  4 flights.parquet day          INT32          NA REQUIRED        INT_32        
#>  5 flights.parquet dep_time     INT32          NA OPTIONAL        INT_32        
#>  6 flights.parquet sched_dep_t… INT32          NA REQUIRED        INT_32        
#>  7 flights.parquet dep_delay    DOUB…          NA OPTIONAL        NA            
#>  8 flights.parquet arr_time     INT32          NA OPTIONAL        INT_32        
#>  9 flights.parquet sched_arr_t… INT32          NA REQUIRED        INT_32        
#> 10 flights.parquet arr_delay    DOUB…          NA OPTIONAL        NA            
#> 11 flights.parquet carrier      BYTE…          NA REQUIRED        UTF8          
#> 12 flights.parquet flight       INT32          NA REQUIRED        INT_32        
#> 13 flights.parquet tailnum      BYTE…          NA OPTIONAL        UTF8          
#> 14 flights.parquet origin       BYTE…          NA REQUIRED        UTF8          
#> 15 flights.parquet dest         BYTE…          NA REQUIRED        UTF8          
#> 16 flights.parquet air_time     DOUB…          NA OPTIONAL        NA            
#> 17 flights.parquet distance     DOUB…          NA REQUIRED        NA            
#> 18 flights.parquet hour         DOUB…          NA REQUIRED        NA            
#> 19 flights.parquet minute       DOUB…          NA REQUIRED        NA            
#> 20 flights.parquet time_hour    INT64          NA REQUIRED        TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> #   precision <int>, field_id <int>
#> 
#> $row_groups
#> # A data frame: 1 × 7
#>   file_name        id total_byte_size num_rows file_offset total_compressed_size
#>   <chr>         <int>           <dbl>    <dbl>       <dbl>                 <dbl>
#> 1 flights.parq…     0         5732430   336776          NA                    NA
#> # ℹ 1 more variable: ordinal <int>
#> 
#> $column_chunks
#> # A data frame: 19 × 19
#>    file_name       row_group column file_path file_offset offset_index_offset
#>    <chr>               <int>  <int> <chr>           <dbl>               <dbl>
#>  1 flights.parquet         0      0 NA                 23                  NA
#>  2 flights.parquet         0      1 NA                111                  NA
#>  3 flights.parquet         0      2 NA                323                  NA
#>  4 flights.parquet         0      3 NA               6738                  NA
#>  5 flights.parquet         0      4 NA             468008                  NA
#>  6 flights.parquet         0      5 NA             893557                  NA
#>  7 flights.parquet         0      6 NA            1312660                  NA
#>  8 flights.parquet         0      7 NA            1771896                  NA
#>  9 flights.parquet         0      8 NA            2237931                  NA
#> 10 flights.parquet         0      9 NA            2653250                  NA
#> 11 flights.parquet         0     10 NA            2847249                  NA
#> 12 flights.parquet         0     11 NA            3374563                  NA
#> 13 flights.parquet         0     12 NA            3877832                  NA
#> 14 flights.parquet         0     13 NA            3966418                  NA
#> 15 flights.parquet         0     14 NA            4264662                  NA
#> 16 flights.parquet         0     15 NA            4639410                  NA
#> 17 flights.parquet         0     16 NA            4976781                  NA
#> 18 flights.parquet         0     17 NA            5120936                  NA
#> 19 flights.parquet         0     18 NA            5427022                  NA
#> # ℹ 13 more variables: offset_index_length <int>, column_index_offset <dbl>,
#> #   column_index_length <int>, type <chr>, encodings <I<list>>,
#> #   path_in_schema <I<list>>, codec <chr>, num_values <dbl>,
#> #   total_uncompressed_size <dbl>, total_compressed_size <dbl>,
#> #   data_page_offset <dbl>, index_page_offset <dbl>,
#> #   dictionary_page_offset <dbl>

The output will include the schema, as above, but also data about the row groups ( write_parquet() always writes a single row group currently), and column chunks. There is one column chunk per column in each row group.

The columns chunk information also tells you whether a column chunk is dictionary encoded, its encoding, its size, etc.

cc <- parquet_metadata("flights.parquet")$column_chunks
cc[, c("column", "encodings", "dictionary_page_offset")]
#> # A data frame: 19 × 3
#>    column encodings dictionary_page_offset
#>     <int> <I<list>>                  <dbl>
#>  1      0 <chr [3]>                      4
#>  2      1 <chr [3]>                     48
#>  3      2 <chr [3]>                    181
#>  4      3 <chr [3]>                   1445
#>  5      4 <chr [3]>                 463903
#>  6      5 <chr [3]>                 891412
#>  7      6 <chr [3]>                1306995
#>  8      7 <chr [3]>                1767223
#>  9      8 <chr [3]>                2235594
#> 10      9 <chr [3]>                2653154
#> 11     10 <chr [3]>                2831850
#> 12     11 <chr [3]>                3352496
#> 13     12 <chr [3]>                3877796
#> 14     13 <chr [3]>                3965856
#> 15     14 <chr [3]>                4262597
#> 16     15 <chr [3]>                4638461
#> 17     16 <chr [3]>                4976675
#> 18     17 <chr [3]>                5120660
#> 19     18 <chr [3]>                5379476
cc[["encodings"]][1:3]
#> [[1]]
#> [1] "PLAIN"          "RLE"            "RLE_DICTIONARY"
#> 
#> [[2]]
#> [1] "PLAIN"          "RLE"            "RLE_DICTIONARY"
#> 
#> [[3]]
#> [1] "PLAIN"          "RLE"            "RLE_DICTIONARY"

Limitations

nanoparquet 0.3.0 has a number of limitations.

  • Only flat tables. read_parquet() can only read flat tables, i.e. Parquet files without nested columns. (Technically all Parquet files are nested, and nanoparquet supports exactly one level of nesting: a single meta column that contains all other columns.) Similarly, write_parquet() will not write list columns.

  • Unsupported Parquet types. read_parquet() reads some Parquet types as raw vectors of a list column currently, e.g. FLOAT16, INTERVAL. See the manual for details.

  • No encryption. Encrypted Parquet files are not supported.

  • Missing compression codecs. LZO, BROTLI and LZ4 compression is not yet supported.

  • No statistics. nanoparquet does not read or write statistics, e.g. minimum and maximum values from and to Parquet files.

  • No checksums. nanoparquet does not check or write checksums currently.

  • No Bloom filters. nanoparquet does not currently support reading or writing Bloom filters from or to Parquet files.

  • May be slow for large files. Being single-threaded and not fully optimized, nanoparquet is probably not suited well for large data sets. It should be fine for a couple of gigabytes. It may be fine if all the data fits into memory comfortably.

  • Single row group. write_parquet() always creates a single row group, which is not optimal for large files.

  • Automatic encoding. It is currently not possible to choose encodings in write_parquet() manually.

We are planning on solving these limitations, while keeping nanoparquet as lean as possible. In particular, if you find a Parquet file that nanoparquet cannot read, please report an issue in our issue tracker!

Other tools for Parquet files

If you run into some of these limitations, chances are you are dealing with a larget data set, and you will probably benefit from using tools geared towards larger Parquet files. Luckily you have several options.

In R

Apache Arrow

You can usually install the arrow package from CRAN. Note, however, that some CRAN builds are suboptimal at the time of writing, e.g. the macOS builds lack Parquet support and it is best to install arrow from R-universe on these platforms.

Call arrow::read_parquet() to read Parquet files, and arrow::write_parquet() to write them. You can also use arrow::open_dataset() to open (one or more) Parquet files and perform queries on them without loading all data into memory.

DuckDB

DuckDB is an excellent tool that handles Parquet files seemlessly. You can install the duckdb R package from CRAN.

To read a Parquet file into an R data frame with DuckDB, call

df <- duckdb:::sql("FROM 'file.parquet'")

Alternatively, you can open (one or more) Parquet files and query them as a DuckDB database, potentially without reading all data into memory at once.

Here is an example that shows how to put an R data frame into a (temporary) DuckDB database, and how to export it to Parquet:

drv <- duckdb::duckdb()
con <- DBI::dbConnect(drv)
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbWriteTable(con, "mtcars", mtcars)

DBI::dbExecute(con, DBI::sqlInterpolate(con,
  "COPY mtcars TO ?filename (FORMAT 'parquet', COMPRESSION 'snappy')",
  filename = 'mtcars.parquet'
))

In Python

There are at least three good options to handle Parquet files in Python. Just like for R, the first two are Apache Arrow and DuckDB. You can also try the fastparquet Python package for a potentially lighter solution.

Acknowledgements

nanoparquet would not exist without the work of Hannes Mühleisen on miniparquet, which had similar goals, but it is discontinued now. nanoparquet is a fork of miniparquet.

nanoparquet also contains code and test Parquet files from DuckDB, Apache Parquet, Apache Arrow, Apache Thrift, it contains libraries from Google, Facebook, etc. see the COPYRIGHTS file in the repository for the full details.