Version 2.0.0 of the tibble package is almost ready for release. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not, with nicer default output too! Grab the development version with:
devtools::install_github("tidyverse/tibble")
We’re making a pre-release announcement, because some changes require the attention of maintainers of packages that import or otherwise depend on tibble. This post describes how to adapt to the next version of tibble and is also an invitation for maintainers to provide feedback before v2.0.0 is finalized and submitted to CRAN. The easiest way to get in touch is to file an issue at https://github.com/tidyverse/tibble/issues (or to comment on an existing one). This blog post is aimed at package developers and those who maintain “production” scripts or apps. A high-level overview of new user-facing features will come in a separate blog post.
Reverse dependency checks
We ran R CMD check
for over 3000 CRAN and Bioconductor packages that depend directly or indirectly on the tibble package and compared results obtained with the CRAN versus development version of tibble.
We will notify the maintainers of all affected packages (regardless of the check results of their package) and aim for a CRAN release before Christmas, so the dust has settled in time for
rstudio::conf.
We made pull requests to implement the necessary changes in several of the most heavily downloaded packages. Based on this experience, this post highlights the problems downstream maintainers are most likely to see and how to solve them. Most fixes should be quite simple.
For the full list of changes, features, and bug fixes, please see the release notes.
Tibble construction and validation
End users should use the
tibble()
function to construct tibbles.
It checks the input for consistency and makes sure that the returned tibble is valid.
Package developers, however, can also consider the low-level
new_tibble()
constructor. Use
new_tibble()
to quickly construct a tibble from a list if you are very sure that the input is well-formed (i.e., a list of vectors of equal length).
This function also supports the construction of subclasses of tibble through the class
argument.
In the development version of tibble, the
new_tibble()
constructor is a more faithful implementation of the
design advice for S3 classes given in Advanced R.
Specifically:
-
new_tibble()
is very fast and does very little checking itself. - The new
validate_tibble()
function is responsible for validating the structure of a tibble.
This means that the nrow
argument to
new_tibble()
is now mandatory.
We are aware that this might be the single most disruptive change, but we think that any guesswork here would be detrimental to stability (especially in corner cases) and that this particular problem is very easy to fix.
The nrow
argument already existed in tibble v1.4.2, so code that uses it requires no change and should continue to work.
If you need to add nrow
arguments to
new_tibble()
calls, you can do so independently of the tibble v2.0.0 release.
Please be aware that nrow
must be passed as a named argument, because it comes after the ellipsis ...
in the signature. Here are common patterns for setting the nrow
argument:
library(tibble)
x <- data.frame(a = 1)
# Code that lacks `nrow` fails
new_tibble(x)
#> Error: Must pass a scalar integer as `nrow` argument to `new_tibble()`.
# Fix by specifying `nrow`
new_tibble(x, nrow = nrow(x)) # if x is a data frame
#> # A tibble: 1 x 1
#> a
#> <dbl>
#> 1 1
nrow_x <- NROW(x[[1]]) # if x has at least one column
# nrow_x <- ... # if the number of rows is given elsewhere
new_tibble(x, nrow = nrow_x)
#> # A tibble: 1 x 1
#> a
#> <dbl>
#> 1 1
Coercion and name repair
The tibble mentality has always been that the user is responsible for managing column names, i.e. names are not automatically munged. This remains true, but the development version of tibble is stricter about names and offers more support for name repair.
In the development version of tibble, by default, column names must exist and be unique. Some packages use
as_tibble()
internally to coerce a dysfunctionally-named object to a tibble and then apply proper column names. Here’s a typical error and solution:
library(tibble)
(m <- cov(unname(iris[-5])))
#> [,1] [,2] [,3] [,4]
#> [1,] 0.6856935 -0.0424340 1.2743154 0.5162707
#> [2,] -0.0424340 0.1899794 -0.3296564 -0.1216394
#> [3,] 1.2743154 -0.3296564 3.1162779 1.2956094
#> [4,] 0.5162707 -0.1216394 1.2956094 0.5810063
# problematic approach:
# 1. make tibble
# 2. apply nice names
x <- as_tibble(m)
#> Error: Columns 1, 2, 3, 4 must be named.
#> Use .name_repair to specify repair.
colnames(x) <- letters[1:4]
#> Error in names(x) <- value: 'names' attribute [4] must be the same length as the vector [1]
# better approach that works with tibble v1.4.2 AND dev tibble:
# 1. apply nice names
# 2. make tibble
colnames(m) <- letters[1:4]
as_tibble(m)
#> # A tibble: 4 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.686 -0.0424 1.27 0.516
#> 2 -0.0424 0.190 -0.330 -0.122
#> 3 1.27 -0.330 3.12 1.30
#> 4 0.516 -0.122 1.30 0.581
If possible, we recommend applying your “good” column names prior to calling
as_tibble()
. This creates code that works with tibble v1.4.2 and the development version, which is very appealing. For good examples, see these pull requests to
drake,
prophet, and
broom.
It is also possible to use the new .name_repair
argument in
tibble()
and
as_tibble()
(more below) to explicitly declare your intention around column names. This code would require packageVersion("tibble") >= "2.0.0"
:
# Alternative: use new `.name_repair` argument to permit dysfunctional names
m <- cov(unname(iris[-5]))
as_tibble(m, .name_repair = "minimal")
#> # A tibble: 4 x 4
#> `` `` `` ``
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.686 -0.0424 1.27 0.516
#> 2 -0.0424 0.190 -0.330 -0.122
#> 3 1.27 -0.330 3.12 1.30
#> 4 0.516 -0.122 1.30 0.581
# Alternative: use new `.name_repair` argument to fix dysfunctional names
m <- cov(unname(iris[-5]))
as_tibble(m, .name_repair = "unique")
#> New names:
#> * `` -> `..1`
#> * `` -> `..2`
#> * `` -> `..3`
#> * `` -> `..4`
#> # A tibble: 4 x 4
#> ..1 ..2 ..3 ..4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.686 -0.0424 1.27 0.516
#> 2 -0.0424 0.190 -0.330 -0.122
#> 3 1.27 -0.330 3.12 1.30
#> 4 0.516 -0.122 1.30 0.581
What is the motivation for this increased attention to column names? The tibble package is offering stronger encouragement for names where each column can be identified by name and, preferably, without having to resort to backticks. Column names that don’t meet these requirements are still allowed, but the user needs to permit them explicitly.
After all, there are scenarios where problematic names should be tolerated. For example, after importing data, the user might need to inspect the data in order to determine which columns to keep. Or perhaps the column names contain data that is about to be converted to a proper variable with
gather()
.
The
tibble()
constructor and the
as_tibble()
generic now support a new .name_repair
argument that covers most use cases:
"minimal"
: No name repair or checks, beyond basic existence."unique"
: Make sure names are unique and not empty."check_unique"
: (default value), no name repair, but check they areunique
."universal"
: Make the namesunique
and syntactic.- a function: apply custom name repair (e.g.,
.name_repair = make.names
or.name_repair = ~make.names(., unique = TRUE)
for names in the style of base R).
See
?`name-repair`
for more details.
Packages that are in the business of making tibbles may even want to expose the .name_repair
argument and pass it through to
tibble()
or
as_tibble()
.
For example, this is the approach planned for
readxl, which reads rectangular data out of Excel workbooks.
validate
in
as_tibble()
In tibble v1.4.2,
as_tibble()
has a validate
argument, but its default behaviour value was inconsistent across different methods and there was no equivalent argument for
tibble()
. The validate
argument is now soft-deprecated and its use will trigger a message, once per session. The validate
argument will eventually be removed, but for now it can be used jointly with the new .name_repair
argument (without even triggering a message). This is possible, because fortunately the .name_repair
argument to
as_tibble()
is ignored in tibble v1.4.2.
Here’s what validate
does in the development version of tibble for tibbles, data frames, and matrices, along with suggested alternatives.
Tibbles and data frames
The default was validate = FALSE
for tibbles and validate = TRUE
for data frames.
Code that worked before for tibbles can now throw unexpected errors if the resulting tibble has problematic names.
To avoid warnings with tibble v2.0.0, use the default instead of validate = TRUE
, and .name_repair = "minimal"
in addition to validate = FALSE
.
If your code targets tibble >= v2.0.0 exclusively, you can remove the validate
argument.
df <- new_tibble(list(a = 5, a = 6), nrow = 1)
# errors, as it should, because names are duplicated ... but also messages
as_tibble(df, validate = TRUE)
#> The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.
# errors due to default .name_repair = "check_unique"
# (but no error in tibble v1.4.2)
as_tibble(df)
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.
# ensures that the validate = TRUE default is used for tibble < 2.0.0
as_tibble(as.data.frame(df))
#> Error: Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.
# no error ... but still messages
as_tibble(df, validate = FALSE)
#> # A tibble: 1 x 2
#> a a
#> <dbl> <dbl>
#> 1 5 6
# no error, quietly
as_tibble(df, .name_repair = "minimal")
#> # A tibble: 1 x 2
#> a a
#> <dbl> <dbl>
#> 1 5 6
# no error, quietly, compatible with tibble < 2.0.0
as_tibble(df, validate = FALSE, .name_repair = "minimal")
#> # A tibble: 1 x 2
#> a a
#> <dbl> <dbl>
#> 1 5 6
Matrices and other objects
The validate
argument now triggers a message, it was silently ignored in v1.4.2.
For compatibility with v2.0.0, remove the validate
argument, or add a consistent .name_repair
argument.
If you need anything other than "minimal"
or "check_unique"
and need to keep the validate
argument, rename the columns beforehand.
m <- cov(iris[-5])
# Assign colnames() if necessary
as_tibble(m)
#> # A tibble: 4 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.686 -0.0424 1.27 0.516
#> 2 -0.0424 0.190 -0.330 -0.122
#> 3 1.27 -0.330 3.12 1.30
#> 4 0.516 -0.122 1.30 0.581
as_tibble(m, validate = TRUE, .name_repair = "check_unique")
#> # A tibble: 4 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.686 -0.0424 1.27 0.516
#> 2 -0.0424 0.190 -0.330 -0.122
#> 3 1.27 -0.330 3.12 1.30
#> 4 0.516 -0.122 1.30 0.581
Deprecation of tidy_names()
and set_tidy_names()
The existing tidy_names()
and set_tidy_names()
functions are soft-deprecated, but remain available, unchanged. In the future, they could go away or take on a new meaning, i.e. implement a different algorithm for name repair. New code should use .name_repair
instead.
df <- new_tibble(list(a = 5, a = 6), nrow = 1)
# these functions are soft-deprecated
tidy_names(names(df))
#> New names:
#> a -> a..1
#> a -> a..2
#> [1] "a..1" "a..2"
set_tidy_names(df)
#> New names:
#> a -> a..1
#> a -> a..2
#> # A tibble: 1 x 2
#> a..1 a..2
#> <dbl> <dbl>
#> 1 5 6
# achieve same via `.name_repair`
as_tibble(df, .name_repair = "universal")
#> New names:
#> * a -> a..1
#> * a -> a..2
#> # A tibble: 1 x 2
#> a..1 a..2
#> <dbl> <dbl>
#> 1 5 6
tibble’s name repair strategies are currently only exposed in
tibble()
and
as_tibble()
, not (yet?) as utility functions that operate on a vector of names.
Other changes
Intentionally assigning invalid names to a tibble via names<-()
is generally a bad idea and this now warns (once per session).
df <- tibble(a = 1)
names(df) <- NA
#> Warning: Column 1 must be named.
#> Warning: Must use a character vector as names.
Coercing a vector to a tibble is no longer supported and emits a warning once per session. It’s not clear if the result should be a tibble with one row or one column. We plan to revisit this in a future version, with an unambiguous interface.
x <- 1:3
# Old:
as_tibble(x)
#> Warning: Calling `as_tibble()` on a vector is discouraged, because the
#> behavior is likely to change in the future. Use `enframe(name = NULL)`
#> instead.
#> # A tibble: 3 x 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3
# New (>= 2.0.0):
enframe(x, name = NULL)
#> # A tibble: 3 x 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3
# New (legacy):
tibble(value = x)
#> # A tibble: 3 x 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3