We’re thrilled to announce the release of rsample 1.3.0. rsample makes it easy to create resamples for assessing model performance. It is part of the tidymodels framework, a collection of R packages for modeling and machine learning using tidyverse principles.
You can install it from CRAN with:
install.packages("rsample")
This blog post will walk you through the more flexible grouping for calculating bootstrap confidence intervals and highlight the contributions made by participants of the tidyverse developer day.
You can see a full list of changes in the release notes.
Flexible grouping for bootstrap intervals
Resampling allows you get an understanding of the variability of an estimate, e.g., a summary statistic of your data. If you want to lean on statistical theory and get confidence intervals for your estimate, you can reach for the bootstrap resampling scheme: calculating your summary statistic on the bootstrap samples enables you to calculate confidence intervals around your point estimate.
rsample contains a family of int_*()
functions to calculate bootstrap confidence intervals of different flavors: percentile intervals, “BCa” intervals, and bootstrap-t intervals. If you want to dive into the technical details, Chapter 11 of
CASI is a good place to start.
You can calculate the confidence intervals based on a grouping in your data. However, so far, rsample would only let you provide a single grouping variable. With this release, we are extending this functionality to allow a more flexible grouping.
The motivating application for us was to be able to calculate confidence intervals around multiple model performance metrics, including dynamic metrics for time-to-event models which depend on an evaluation time point. So in this case, the metric is one grouping variable and the evaluation time another. But let’s pull back complexity for an example of how the new rsample functionality works!
We have a dataset with delivery times for orders containing one or more items. We’ll do some data wrangling with it, so we are also loading dplyr.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(deliveries, package = "modeldata")
deliveries
#> # A tibble: 10,012 × 31
#> time_to_delivery hour day distance item_01 item_02 item_03 item_04 item_05
#> <dbl> <dbl> <fct> <dbl> <int> <int> <int> <int> <int>
#> 1 16.1 11.9 Thu 3.15 0 0 2 0 0
#> 2 22.9 19.2 Tue 3.69 0 0 0 0 0
#> 3 30.3 18.4 Fri 2.06 0 0 0 0 1
#> 4 33.4 15.8 Thu 5.97 0 0 0 0 0
#> 5 27.2 19.6 Fri 2.52 0 0 0 1 0
#> 6 19.6 13.0 Sat 3.35 1 0 0 1 0
#> 7 22.1 15.5 Sun 2.46 0 0 1 1 0
#> 8 26.6 17.0 Thu 2.21 0 0 1 0 0
#> 9 30.8 16.7 Fri 2.62 0 0 0 0 0
#> 10 17.4 11.9 Sun 2.75 0 2 1 0 0
#> # ℹ 10,002 more rows
#> # ℹ 22 more variables: item_06 <int>, item_07 <int>, item_08 <int>,
#> # item_09 <int>, item_10 <int>, item_11 <int>, item_12 <int>, item_13 <int>,
#> # item_14 <int>, item_15 <int>, item_16 <int>, item_17 <int>, item_18 <int>,
#> # item_19 <int>, item_20 <int>, item_21 <int>, item_22 <int>, item_23 <int>,
#> # item_24 <int>, item_25 <int>, item_26 <int>, item_27 <int>
Instead of fitting a whole model here, we are calculating a straightforward summary statistic for how much delivery time increases if an item is included in the order. So the item is one grouping factor. As a second one, we are using whether the order was delivered on a weekday or a weekend. Let’s start by making that weekend indicator and reshaping the data to make it easier to calculate our summary statistic.
Note that the name for the weekend indicator column, .weekend
, starts with a dot. That is important as it is the convention to signal to rsample that this is an additional grouping variable.
item_data <- deliveries %>%
mutate(.weekend = ifelse(day %in% c("Sat", "Sun"), "weekend", "weekday")) %>%
select(time_to_delivery, .weekend, starts_with("item")) %>%
tidyr::pivot_longer(starts_with("item"), names_to = "item", values_to = "value")
Next, we are making a small function that calculates the ratio of average delivery times with and without the item included in the order, as a estimate of how much a specific item in an order increases the delivery time.
relative_increase <- function(data) {
data %>%
mutate(includes_item = value > 0) %>%
summarize(
has = mean(time_to_delivery[includes_item]),
has_not = mean(time_to_delivery[!includes_item]),
.by = c(item, .weekend)
) %>%
mutate(estimate = has / has_not) %>%
select(term = item, .weekend, estimate)
}
We can calculate that on our entire dataset.
relative_increase(item_data)
#> # A tibble: 54 × 3
#> term .weekend estimate
#> <chr> <chr> <dbl>
#> 1 item_01 weekday 1.07
#> 2 item_02 weekday 1.02
#> 3 item_03 weekday 1.02
#> 4 item_04 weekday 1.00
#> 5 item_05 weekday 1.00
#> 6 item_06 weekday 1.01
#> 7 item_07 weekday 1.03
#> 8 item_08 weekday 1.01
#> 9 item_09 weekday 1.01
#> 10 item_10 weekday 1.06
#> # ℹ 44 more rows
This is fine, but what we really want here is to get confidence intervals around these estimates!
So let’s make bootstrap samples and calculate our statistic on those.
set.seed(1)
item_bootstrap <- bootstraps(item_data, times = 1000)
item_stats <-
item_bootstrap %>%
mutate(stats = purrr::map(splits, ~ analysis(.x) %>% relative_increase()))
Now we have everything we need to calculate the confidence intervals, stashed into the tibbles in the stats
column: an estimate
, a term
(the primary grouping variable), and our additional grouping variable .weekend
, starting with a dot. What’s left to do is call one of the int_*()
functions and specify which column contains the statistics. Here, we’ll calculate percentile intervals with
int_pctl()
.
item_ci <- int_pctl(item_stats, statistics = stats, alpha = 0.1)
item_ci
#> # A tibble: 54 × 7
#> term .weekend .lower .estimate .upper .alpha .method
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 item_01 weekday 1.05 1.07 1.09 0.1 percentile
#> 2 item_01 weekend 1.04 1.07 1.10 0.1 percentile
#> 3 item_02 weekday 1.00 1.02 1.03 0.1 percentile
#> 4 item_02 weekend 0.996 1.01 1.03 0.1 percentile
#> 5 item_03 weekday 1.01 1.02 1.04 0.1 percentile
#> 6 item_03 weekend 0.970 0.990 1.01 0.1 percentile
#> 7 item_04 weekday 0.989 1.00 1.02 0.1 percentile
#> 8 item_04 weekend 0.998 1.02 1.03 0.1 percentile
#> 9 item_05 weekday 0.987 1.00 1.02 0.1 percentile
#> 10 item_05 weekend 0.982 1.00 1.03 0.1 percentile
#> # ℹ 44 more rows
Tidyverse developer day
At the tidyverse developer day after posit::conf, rsample got a lot of love in form of contributions by various community members. People improved documentation and examples, move deprecations along, tightened checks to support good practice, and upgraded errors and warnings, both in style and content. None of these changes are flashy new features but all of them are essential to rsample working well!
So for example, leave-one-out (LOO) cross-validation is not a great choice of resampling scheme in most situations. From Tidy modeling with R:
For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.
It was possible, however, to create implicit LOO samples by using
vfold_cv()
with the number of folds set to the number of rows in the data. With a dev day contribution, this now errors:
vfold_cv(mtcars, v = nrow(mtcars))
#> Error in `vfold_cv()`:
#> ! Leave-one-out cross-validation is not supported by this function.
#> ✖ You set `v` to `nrow(data)`, which would result in a leave-one-out
#> cross-validation.
#> ℹ Use `loo_cv()` in this case.
This is to make users pause and consider if this a good choice for their dataset. If you require LOO, you can still use
loo_cv()
.
Error messages in general have been a focus of ours across various tidymodels packages, rsample is no exception. We opened a bunch of issues to tackle all of rsample - and all got closed! Some of these changes are purely internal, upgrading manual formatting to let the cli package do the work. While the error message in most cases doesn’t look different, it’s a great deal more consistency in formatting.
For some error messages, the additional functionality in cli makes it easy to improve readability. This error message used to be one block of text, now it comes as three bullet points.
permutations(mtcars, everything())
#> Error in `permutations()`:
#> ! You have selected all columns to permute.
#> ℹ This effectively reorders the rows in the original data without changing the
#> data structure.
#> → Please select fewer columns to permute.
Changes like these are super helpful to users and developers alike. A big thank you to all the contributors!
Acknowledgements
Many thanks to all the people who contributed to rsample since the last release!
@agmurray, @brshallo, @ccani007, @dicook, @Dpananos, @EmilHvitfeldt, @gaborcsardi, @gregor-fausto, @hfrick, @JamesHWade, @jttoivon, @krz, @laurabrianna, @malcolmbarrett, @MatthieuStigler, @msberends, @nmercadeb, @PriKalra, @seb09, @simonpcouch, @topepo, @ZWael, and @zz77zz.