We’re thrilled to announce the release of readr 2.0.0!
The readr package makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
The easiest way to install the latest version from CRAN is to install the whole tidyverse.
install.packages("tidyverse")
Alternatively, install just readr from CRAN:
install.packages("readr")
This blog post will show off the most important changes to the package. These include built-in support for reading multiple files at once, lazy reading and automatic guessing of delimiters among other changes.
You can see a full list of changes in the readr release notes and vroom release notes.
readr 2nd edition
readr 2.0.0 is a major release of readr and introduces a new 2nd edition parsing and writing engine implemented via the vroom package. This engine takes advantage of lazy reading, multi-threading and performance characteristics of modern SSD drives to significantly improve the performance of reading and writing compared to the 1st edition engine.
We have done our best to ensure that the two editions parse csv files as similarly as possible, but in case there are differences that affect your code, you can use the
with_edition()
or
local_edition()
functions to temporarily change the edition of readr for a section of code:
-
with_edition(1, read_csv("my_file.csv"))
will readmy_file.csv
with the 1st edition of readr. -
readr::local_edition(1)
placed at the top of your function or script will use the 1st edition for the rest of the function or script.
We will continue to support the 1st edition for a number of releases, but our goal is to ensure that the 2nd edition is uniformly better than the 1st edition so we plan to eventually deprecate and then remove the 1st edition code.
Reading multiple files at once
The 2nd edition has built-in support for reading sets of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function.
First we generate some files to read by splitting the nycflights dataset by airline.
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("/tmp/flights_{.y}.tsv"), delim = "\t") }
)
Then we can efficiently read them into one tibble by passing the filenames directly to readr.
If the filenames contain data, such as the date when the sample was collected, use id
argument to include the paths as a column in the data. You will likely have to post-process the paths to keep only the relevant portion for your use case.
library(dplyr)
files <- fs::dir_ls(path = "/tmp", glob = "*flights*tsv")
files
#> /tmp/flights_9E.tsv /tmp/flights_AA.tsv /tmp/flights_AS.tsv
#> /tmp/flights_B6.tsv /tmp/flights_DL.tsv /tmp/flights_EV.tsv
#> /tmp/flights_F9.tsv /tmp/flights_FL.tsv /tmp/flights_HA.tsv
#> /tmp/flights_MQ.tsv /tmp/flights_OO.tsv /tmp/flights_UA.tsv
#> /tmp/flights_US.tsv /tmp/flights_VX.tsv /tmp/flights_WN.tsv
#> /tmp/flights_YV.tsv
readr::read_tsv(files, id = "path")
#> Rows: 336776 Columns: 20
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (4): carrier, tailnum, origin, dest
#> dbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t...
#> dttm (1): time_hour
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 336,776 x 20
#> path year month day dep_time sched_dep_time dep_delay arr_time
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 /tmp/fligh… 2013 1 1 810 810 0 1048
#> 2 /tmp/fligh… 2013 1 1 1451 1500 -9 1634
#> 3 /tmp/fligh… 2013 1 1 1452 1455 -3 1637
#> 4 /tmp/fligh… 2013 1 1 1454 1500 -6 1635
#> 5 /tmp/fligh… 2013 1 1 1507 1515 -8 1651
#> 6 /tmp/fligh… 2013 1 1 1530 1530 0 1650
#> 7 /tmp/fligh… 2013 1 1 1546 1540 6 1753
#> 8 /tmp/fligh… 2013 1 1 1550 1550 0 1844
#> 9 /tmp/fligh… 2013 1 1 1552 1600 -8 1749
#> 10 /tmp/fligh… 2013 1 1 1554 1600 -6 1701
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <dbl>,
#> # arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
Lazy reading
Like vroom, the 2nd edition uses lazy reading by default. This means when you first call a read_*()
function the delimiters and newlines throughout the entire file are found, but the data is not actually read until it is used in your program. This can provide substantial speed improvements for reading character data. It is particularly useful during interactive exploration of only a subset of a full dataset.
However this also means that problematic values are not necessarily seen immediately, only when they are actually read. Because of this a warning will be issued the first time a problem is encountered, which may happen after initial reading.
Run
problems()
on your dataset to read the entire dataset and return all of the problems found. Run
problems(lazy = TRUE)
if you only want to retrieve the problems found so far.
Deleting files after reading is also impacted by laziness. On Windows open files cannot be deleted as long as a process has the file open. Because readr keeps a file open when reading lazily this means you cannot read, then immediately delete the file. readr will in most cases close the file once it has been completely read. However, if you know you want to be able to delete the file after reading it is best to pass lazy = FALSE
when reading the file.
Delimiter guessing
The 2nd edition supports automatic guessing of delimiters. This feature is inspired by the automatic guessing in
data.table::fread(), though the precise method used to perform the guessing differs. Because of this you can now use
read_delim()
without specifying a delim
argument in many cases.
x <- read_delim(readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New column specification output
On February 11, 2021 we conducted a survey on twitter asking for the community’s opinion on the column specification output in readr. We received over 750 😲 responses to the survey and it revealed a lot of useful information
- 3/4 of respondents found printing the column specifications helpful. 👍
- 2/3 of respondents preferred the 2nd edition output vs 1st edition output. 💅
- Only 1/5 of respondents correctly knew how to suppress printing of the column specifications. 🤯
Based on these results we have added two new ways to more easily suppress the column specification printing.
- Use
read_csv(show_col_types = FALSE)
to disable printing for a single function call. - Use
options(readr.show_types = FALSE)
to disable printing for the entire session.
We will also continue to print the column specifications and use the new style output.
Note you can still obtain the old output style by printing the column specification object directly.
spec(x)
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )
Or show the new style by calling
summary()
on the specification object.
summary(spec(x))
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
Column selection
The 2nd edition introduces a new argument, col_select
, which makes selecting columns to keep (or omit) more straightforward than before. col_select
uses the same interface as
dplyr::select()
, so you can perform very flexible selection operations.
-
Select with the column names directly.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(year, flight, tailnum)) #> Rows: 32729 Columns: 3 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (1): tailnum #> dbl (2): year, flight #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
Or by numeric column.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(1, 2)) #> Rows: 32729 Columns: 2 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> dbl (2): year, month #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
Drop columns by name by prefixing them with
-
.data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(-dep_time, -(air_time:time_hour))) #> Rows: 32729 Columns: 13 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (4): carrier, tailnum, origin, dest #> dbl (9): year, month, day, sched_dep_time, dep_delay, arr_time, sched_a... #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
Use the selection helpers such as
ends_with()
.data <- read_tsv("/tmp/flights_AA.tsv", col_select = ends_with("time")) #> Rows: 32729 Columns: 5 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> dbl (5): dep_time, sched_dep_time, arr_time, sched_arr_time, air_time #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
Or even rename columns by using a named list.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = list(plane = tailnum, everything())) #> Rows: 32729 Columns: 19 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (4): carrier, tailnum, origin, dest #> dbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t... #> dttm (1): time_hour #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. data #> # A tibble: 32,729 x 19 #> plane year month day dep_time sched_dep_time dep_delay arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 N619AA 2013 1 1 542 540 2 923 #> 2 N3ALAA 2013 1 1 558 600 -2 753 #> 3 N3DUAA 2013 1 1 559 600 -1 941 #> 4 N633AA 2013 1 1 606 610 -4 858 #> 5 N3EMAA 2013 1 1 623 610 13 920 #> 6 N3BAAA 2013 1 1 628 630 -2 1137 #> 7 N3CYAA 2013 1 1 629 630 -1 824 #> 8 N3GKAA 2013 1 1 635 635 0 1028 #> 9 N4WNAA 2013 1 1 656 700 -4 854 #> 10 N5FMAA 2013 1 1 656 659 -3 949 #> # … with 32,719 more rows, and 11 more variables: sched_arr_time <dbl>, #> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>
Name repair
Often the names of columns in the original dataset are not ideal to work with. The 2nd edition uses the same name_repair argument as in the tibble package, so you can use one of the default name repair strategies or provide a custom function. One useful approach is to use the janitor::make_clean_names() function.
read_tsv("/tmp/flights_AA.tsv", name_repair = janitor::make_clean_names)
#> # A tibble: 32,729 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 542 540 2 923
#> 2 2013 1 1 558 600 -2 753
#> 3 2013 1 1 559 600 -1 941
#> 4 2013 1 1 606 610 -4 858
#> 5 2013 1 1 623 610 13 920
#> 6 2013 1 1 628 630 -2 1137
#> 7 2013 1 1 629 630 -1 824
#> 8 2013 1 1 635 635 0 1028
#> 9 2013 1 1 656 700 -4 854
#> 10 2013 1 1 656 659 -3 949
#> # … with 32,719 more rows, and 12 more variables: sched_arr_time <dbl>,
#> # arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
read_tsv("/tmp/flights_AA.tsv", name_repair = ~ janitor::make_clean_names(., case = "lower_camel"))
#> # A tibble: 32,729 x 19
#> year month day depTime schedDepTime depDelay arrTime schedArrTime
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 542 540 2 923 850
#> 2 2013 1 1 558 600 -2 753 745
#> 3 2013 1 1 559 600 -1 941 910
#> 4 2013 1 1 606 610 -4 858 910
#> 5 2013 1 1 623 610 13 920 915
#> 6 2013 1 1 628 630 -2 1137 1140
#> 7 2013 1 1 629 630 -1 824 810
#> 8 2013 1 1 635 635 0 1028 940
#> 9 2013 1 1 656 700 -4 854 850
#> 10 2013 1 1 656 659 -3 949 959
#> # … with 32,719 more rows, and 11 more variables: arrDelay <dbl>,
#> # carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>,
#> # airTime <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # timeHour <dttm>
UTF-16 and UTF-32 support
The 2nd edition now has much better support for UTF-16 and UTF-32 multi-byte unicode encodings. When files with these encodings are read they are automatically converted to UTF-8 internally in an efficient streaming fashion.
Control over quoting and escaping when writing
You can now explicitly control how fields are quoted and escaped when writing with the quote
and escape
arguments to write_*()
functions.
quote
has three options.
- ‘needed’ - which will quote fields only when needed.
- ‘all’ - which will always quote all fields.
- ‘none’ - which will never quote any fields.
escape
also has three options, to control how quote characters are escaped.
- ‘double’ - which will use double quotes to escape quotes.
- ‘backslash’ - which will use a backslash to escape quotes.
- ‘none’ - which will not do anything to escape quotes.
We hope these options will give people the flexibility they need when writing files using readr.
Literal data
In the 1st edition the reading functions treated any input with a newline in it or vectors of length > 1 as literal data. In the 2nd edition two vectors of length > 1 are now assumed to correspond to multiple files. Because of this we now have a more explicit way to represent literal data, by putting
I()
around the input.
readr::read_csv(I("a,b\n1,2"))
#> Rows: 1 Columns: 2
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Lighter installation requirements
readr now should be much easier to install. Previous versions of readr used the Boost C++ library to do some of the numeric parsing. While these are well written, robust libraries, the BH package which contains them has a large number of files (1500+) which can take a long time to install. In addition the code within these headers is complicated and can take a large amount of memory (2+ Gb) to compile, which made it challenging to compile readr from source in some cases.
readr no longer depends on Boost or the BH package, so should compile more quickly in most cases.
Deprecated and superseded functions and features
-
melt_csv()
,melt_delim()
,melt_tsv()
andmelt_fwf()
have been superseded by functions in the same name in the meltr package. The versions in readr have been deprecated. These functions rely on the 1st edition parsing code and would be challenging to update to the new parser. When the 1st edition parsing code is eventually removed from readr they will be removed. -
read_table2()
has been renamed toread_table()
andread_table2()
has been deprecated. Most users seem to expectread_table()
to work likeutils::read.table()
, so the different names caused confusion. If you want the previous strict behavior ofread_table()
you can useread_fwf()
withfwf_empty()
directly (#717). -
Normalizing newlines in files with just carriage returns
\r
is no longer supported. The last major OS to use only CR as the newline was ‘classic’ Mac OS, which had its final release in 2001.
License changes
We are systematically re-licensing tidyverse and r-lib packages to use the MIT license, to make our package licenses as clear and permissive as possible.
To this end the readr and vroom packages are now released under the MIT license.
Acknowledgements
A big thanks to everyone who helped make this release possible by testing the development versions, asking questions, providing reprexes, writing code and more! @Aariq, @adamroyjones, @antoine-sachet, @basille, @batpigandme, @benjaminhlina, @bigey, @billdenney, @binkleym, @BrianOB, @cboettig, @CTMCBP, @Dana996, @DarwinAwardWinner, @deeenes, @dernst, @dicorynia, @estroger34, @FixTestRepeat, @GegznaV, @giocomai, @GiuliaPais, @hadley, @HedvigS, @HenrikBengtsson, @hidekoji, @hongooi73, @hsbadr, @idshklein, @jasyael, @JeremyPasco, @jimhester, @jonasfoe, @jzadra, @KasperThystrup, @keesdeschepper, @kingcrimsontianyu, @KnutEBakke, @krlmlr, @larnsce, @ldecicco-USGS, @M3IT, @maelle, @martinmodrak, @meowcat, @messersc, @mewu3, @mgperry, @michaelquinn32, @MikeJohnPage, @mine-cetinkaya-rundel, @msberends, @nbenn, @niheaven, @peranti, @petrbouchal, @pfh, @pgramme, @Raesu, @rmcd1024, @rmvpaeme, @sebneus, @seth127, @Shians, @sonicdoe, @svraka, @timothy-barry, @tmalsburg, @vankesteren, @xuqingyu, and @yutannihilation.