We’re pleased to announce the release of readr 2.1.0. The readr package makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
The easiest way to install the latest version from CRAN is to install the whole tidyverse.
install.packages("tidyverse")
Alternatively, install just readr from CRAN:
install.packages("readr")
This blog post will discuss the recent change in readr 2.0 to lazy reading by default, and the recent change back to eager reading in readr 2.1.
You can see a full list of changes in the readr release notes.
The advantages of Lazy reading
readr 2.0 introduced ‘lazy’ reading by default. The idea of lazy reading is that instead of reading all the data in a CSV file up front you instead read it only on-demand.
For example the following code reads the column headers, filters based on column hp
then computes the mean of the filtered column mpg
.
library(tidyverse)
df <- read_csv(readr_example("mtcars.csv"), lazy = TRUE)
df |>
filter(hp > 200) |>
summarise(mean(mpg))
#> # A tibble: 1 × 1
#> `mean(mpg)`
#> <dbl>
#> 1 13.4
When you run this example readr would then only read the data in the orange colored parts of the file.
The long orange strip at the top is the column headers, the vertical bar is the data in the hp
column, and the dotted parts are the filtered mpg
values used to calculate the mean. As you can see depending on what you are doing with the file using lazy reading can drastically reduce the amount of the total file you end up needing to access.
This idea, first explored in the vroom package, can result in considerable speed improvements depending on the size of the file and what parts you are interested in. readr 2.0 used vroom under the hood to provide this type of lazy reading by default.
The problems with lazy reading
vroom was first released to CRAN in May 2019, and not added to readr until 2 years later in July 2021. Unfortunately usage of vroom was dwarfed by that of readr and the overall pool of users using vroom remained small. In particular the proportion of Windows users using vroom was much lower than those using readr. Crucially the behavior of lazy reading on Windows suffers due to how Windows works with file handles.
One major downside to lazy reading is that the program needs to keep a file handle open to the file. File handles are the low level way computer programs read to or write to a file. How they work varies by the operating system. POSIX (Portable Operating System Interface) systems like macOS and linux allow multiple processes to hold read only file handles to the same file. In contrast on Windows the situation is different, once a process opens a file handle that file is locked from other processes opening it for as long as the handle is open. If you have ever encountered a message like
File/Folder in Use. The action can’t be completed because the file is open in another program. Close the folder or file and try again.
Then this file handle locking behavior is the likely cause. You are trying to open a file in program A that program B also has open.
We were aware of this issue, however we underestimated the prevalence of situations where users would run into this problem and amount of user confusion this would entail.
Decision to change the default
Upon release of readr 2.0 most of the reaction was positive. However a number of people opened issues related to locked files and the use of lazy reading. Many of these cases occurred when users tried to open files in other programs like Excel or view it in the RStudio IDE, but there was another case we hadn’t considered in detail.
A number of users had workflows where they read in a file, cleaned the data in R, and then wrote back to that same file name in the same R session. vroom and readr’s writing functions had code in them to ensure if users did this with a lazily read data frame the data would be first fully read eagerly (and the file handle closed) before writing. However other functions we don’t control (like
utils::write.csv
,
data.table::fread()
, etc.) would have no notion of this problem and would therefore fail to work. In addition this failure is hard to reason about unless you have a good mental model of how lazy reading works and happened to know that readr 2.0 now used lazy reading by default.
Because of the prevalence of these issues we started to consider changing the default to eager rather than lazy reading. But to get a better sense of the community’s opinion we conducted a survey about this issue. We received over 250 responses to the survey (thanks to everyone who responded!) and the results were very conclusive.
- ~1/2 of the respondents used Windows
- ~3/4 of users overall would prefer
lazy = FALSE
as the default. - ~9/10 of Windows users would prefer
lazy = FALSE
as the default.
This reinforced our intuition that changing the default to lazy = FALSE
was the right choice for the community going forward.
Controlling the default yourself
Aside from changing the default to lazy = FALSE
readr 2.1 also gives users a way to control the lazy default themselves.
options(readr.read_lazy = TRUE)
This will change the default value back to lazy by default for the current R session. Note that this can have unintended consequences, code in downstream packages you are using may be using readr without your knowledge, and changing the default will also change their usage. The surest way to ensure consistency in your own code is to explicitly set either lazy = FALSE
or lazy = TRUE
when you call a
read_csv()
function.
Lessons learned
Reaching a more representative cross section of users and having them experiment with a new package is a challenge. As mentioned above, vroom was on CRAN for more than two years, and had significant performance advantages to readr, but even so only a small fraction of the community ended up using it. Crucially this usage did not reveal a complete enough picture of the challenges associated with lazy reading.
Most R users seem to prefer something which ‘just works’ for all use cases, even at the cost of reduced default performance.
Community surveys continue to be the best way to gauge the overall opinion of the community. In hindsight, we should have conducted the survey prior to the release of readr 2.0, though the full scope of the issue was not well known then.
Thank you to everyone who has used readr, opened an issue about this topic, or responded to the survey. Open source software is written to serve the community and your input is crucial to make sure we are making the best decisions. We apologize if this issue affected your work negatively and hope this article helps explain our rational for the initial behavior and change back. We hope readr will continue to make you more productive in the future.
Also a special thank you to all the 81 contributors who opened issues or contributed code since the readr 2.0 release, without your input readr would be a less useful package.
@a-hurst, @alon-sarid, @anonsmoose, @AshesITR, @bersbersbers, @boshek, @chrbknudsen, @christopherkenny, @cwby, @damianooldoni, @Darxor, @DizzyLimit, @djnavarro, @dongzhuoer, @dzhang32, @eutwt, @fernandovmacedo, @garrettgman, @garthtarr, @ggrothendieck, @gorkang, @hadley, @HakuShuu, @HedvigS, @HenrikBengtsson, @hidekoji, @hongyuanjia, @huixinz2, @ibombonato, @jcarbaut, @jeffeaton, @jennybc, @jimhester, @jimmyday12, @jkeuskamp, @jmbarbone, @jmobrien, @JoshuaSturm, @jpquast, @kiernann, @knokknok, @krlmlr, @l-gorman, @lindsayplatt, @lionel-, @lukeholman, @MilesMcBain, @mkvasnicka, @nigeljmckernan, @nik-humphries, @nstjhp, @oharac, @palderman, @peterdesmet, @pfreese, @PierreStevenin, @pieterjanvc, @pschloss, @rbufba, @rdinnager, @richelbilderbeek, @rvalieris, @s-andrews, @saulo1305, @sbachstein, @sdevine188, @ShinyFabio, @slodge, @snaut, @stephenturner, @svraka, @tarheel, @TCLamnidis, @thackl, @timothy-barry, @timothyslau, @tmelliott, @Tonyynot14, @UrsineWelles, @xinyu-zheng, and @yogat3ch.