haven 2.2.0

We’re delighted to announce that haven 2.2.0 is now on CRAN. haven enables R to read and write various data formats used by other statistical packages by wrapping the ReadStat C library written by Evan Miller.

You can install haven from CRAN with:

install.packages("haven")

This release features big improvements thanks to the hard work of Mikko Marttila: all read_*() functions gain three new arguments that allow you to read in only part of a large file. I’ll quickly show of these features by saving out the diamonds dataset to a Stata file:

library(haven)
write_dta(ggplot2::diamonds, "diamonds.dta")

n_max and skip allow you to read in just a portion of the rows:

read_dta("diamonds.dta", n_max = 5)
#> # A tibble: 5 x 10
#>   carat         cut     color   clarity depth table price     x     y     z
#>   <dbl>   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23  5 [Ideal]       2 [E]   2 [SI2]  61.5    55   326  3.95  3.98  2.43
#> 2 0.21  4 [Premium]     2 [E]   3 [SI1]  59.8    61   326  3.89  3.84  2.31
#> 3 0.23  2 [Good]        2 [E]   5 [VS1]  56.9    65   327  4.05  4.07  2.31
#> 4 0.290 4 [Premium]     6 [I]   4 [VS2]  62.4    58   334  4.2   4.23  2.63
#> 5 0.31  2 [Good]        7 [J]   2 [SI2]  63.3    58   335  4.34  4.35  2.75
read_dta("diamonds.dta", skip = 4, n_max = 5)
#> # A tibble: 5 x 10
#>   carat          cut     color  clarity depth table price     x     y     z
#>   <dbl>    <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  0.31 2 [Good]         7 [J] 2 [SI2]   63.3    58   335  4.34  4.35  2.75
#> 2  0.24 3 [Very Goo…     7 [J] 6 [VVS2]  62.8    57   336  3.94  3.96  2.48
#> 3  0.24 3 [Very Goo…     6 [I] 7 [VVS1]  62.3    57   336  3.95  3.98  2.47
#> 4  0.26 3 [Very Goo…     5 [H] 3 [SI1]   61.9    55   337  4.07  4.11  2.53
#> 5  0.22 1 [Fair]         2 [E] 4 [VS2]   65.1    61   337  3.87  3.78  2.49

And col_select() allows you to read in just some of the columns, using the same syntax that you use with dplyr::select():

read_dta("diamonds.dta", col_select = c(x:z))
#> # A tibble: 53,940 x 3
#>        x     y     z
#>    <dbl> <dbl> <dbl>
#>  1  3.95  3.98  2.43
#>  2  3.89  3.84  2.31
#>  3  4.05  4.07  2.31
#>  4  4.2   4.23  2.63
#>  5  4.34  4.35  2.75
#>  6  3.94  3.96  2.48
#>  7  3.95  3.98  2.47
#>  8  4.07  4.11  2.53
#>  9  3.87  3.78  2.49
#> 10  4     4.05  2.39
#> # … with 53,930 more rows

read_dta("diamonds.dta", col_select = starts_with("c"))
#> # A tibble: 53,940 x 4
#>    carat           cut     color   clarity
#>    <dbl>     <dbl+lbl> <dbl+lbl> <dbl+lbl>
#>  1 0.23  5 [Ideal]         2 [E]  2 [SI2] 
#>  2 0.21  4 [Premium]       2 [E]  3 [SI1] 
#>  3 0.23  2 [Good]          2 [E]  5 [VS1] 
#>  4 0.290 4 [Premium]       6 [I]  4 [VS2] 
#>  5 0.31  2 [Good]          7 [J]  2 [SI2] 
#>  6 0.24  3 [Very Good]     7 [J]  6 [VVS2]
#>  7 0.24  3 [Very Good]     6 [I]  7 [VVS1]
#>  8 0.26  3 [Very Good]     5 [H]  3 [SI1] 
#>  9 0.22  1 [Fair]          2 [E]  4 [VS2] 
#> 10 0.23  3 [Very Good]     5 [H]  5 [VS1] 
#> # … with 53,930 more rows

These features allow you to read in datasets that would otherwise not fit in memory, and should substantially improve performance when you only need a few rows or columns from a large file.

This release includes a number of other bug fixes and small improvements, see the changelog for a complete list.

Acknowledgements

A big thanks to everyone who helped out with this release: @aaronrudkin, @batpigandme, @ccccfys, @courtiol, @hadley, @Hadsga, @jeremy17-Endo, @KyleHaynes, @lguangyu, @mihagazvoda, @mikmart, @MokeEire, @npaszty, @pvanheus, @RaymondBalise, @sclewis23, @shubham1637, @sigbertklinke, @steffen-stell, and @ttrodrigz.