tune 1.2.0

We’re indubitably amped to announce the release of tune 1.2.0, a package for hyperparameter tuning in the tidymodels framework.

You can install it from CRAN, along with the rest of the core packages in tidymodels, using the tidymodels meta-package:

install.packages("tidymodels")

The 1.2.0 release of tune has introduced support for two major features that we’ve written about on the tidyverse blog already:

While those features got their own blog posts, there are several more features in this release that we thought were worth calling out. This post will highlight improvements to our support for parallel processing, the introduction of support for percentile confidence intervals for performance metrics, and a few other bits and bobs. You can see a full list of changes in the release notes.

library(tidymodels)

Throughout this post, I’ll refer to the example of tuning an XGBoost model to predict the fuel efficiency of various car models. I hear this is already a well-explored modeling problem, but alas:

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

Note that we’ve used the control option save_pred = TRUE to indicate that we want to save the predictions from our resampled models in the tuning results. Both int_pctl() and compute_metrics() below will need those predictions. The metrics for our resampled model look like so:

collect_metrics(xgb_res)
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric .estimator   mean     n std_err .config              
#>   <int>      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
#> 1     2    0.00204 rmse    standard   19.7      25  0.262  Preprocessor1_Model01
#> 2     2    0.00204 rsq     standard    0.659    25  0.0314 Preprocessor1_Model01
#> 3     6    0.00859 rmse    standard   18.0      25  0.260  Preprocessor1_Model02
#> 4     6    0.00859 rsq     standard    0.607    25  0.0270 Preprocessor1_Model02
#> 5     3    0.0276  rmse    standard   14.0      25  0.267  Preprocessor1_Model03
#> 6     3    0.0276  rsq     standard    0.710    25  0.0237 Preprocessor1_Model03
#> # ℹ 14 more rows

Modernized support for parallel processing

The tidymodels framework has long supported evaluating models in parallel using the foreach package. This release of tune has introduced support for parallelism using the futureverse framework, and we will begin deprecating our support for foreach in a coming release.

To tune a model in parallel with foreach, a user would load a parallel backend package (usually with a name like library(doBackend)) and then register it with foreach (with a function call like registerDoBackend()). The tune package would then detect that registered backend and take it from there. For example, the code to distribute the above tuning process across 10 cores with foreach would look like:

library(doMC)
registerDoMC(cores = 10)

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

The code to do so with future is similarly simple. Users first load the future package, and then specify a plan() which dictates how computations will be distributed. For example, the code to distribute the above tuning process across 10 cores with future looks like:

library(future)
plan(multisession, workers = 10)

set.seed(2024)

xgb_res <- 
  tune_grid(
    boost_tree(mode = "regression", mtry = tune(), learn_rate = tune()),
    mpg ~ .,
    bootstraps(mtcars),
    control = control_grid(save_pred = TRUE)
  )

For users, the transition to parallelism with future has several benefits:

The futureverse presently supports a greater number of parallelism technologies and has been more likely to receive implementations for new ones.
Once foreach is fully deprecated, users will be able to use the interactive logger when tuning in parallel.

From our perspective, transitioning our parallelism support to future makes our packages much more maintainable, reducing complexity in random number generation, error handling, and progress reporting.

In an upcoming release of the package, you’ll see a deprecation warning when a foreach parallel backend is registered but no future plan has been specified, so start transitioning your code sooner than later!

Percentile confidence intervals

Following up on changes in the most recent rsample release, tune has introduced a method for int_pctl() that calculates percentile confidence intervals for performance metrics. To calculate a 90% confidence interval for the values of each performance metric returned in collect_metrics(), we’d write:

set.seed(2024)

int_pctl(xgb_res, alpha = .1)
#> # A tibble: 20 × 8
#>   .metric .estimator .lower .estimate .upper .config             mtry learn_rate
#>   <chr>   <chr>       <dbl>     <dbl>  <dbl> <chr>              <int>      <dbl>
#> 1 rmse    bootstrap  18.1      19.9   22.0   Preprocessor1_Mod…     2    0.00204
#> 2 rsq     bootstrap   0.570     0.679  0.778 Preprocessor1_Mod…     2    0.00204
#> 3 rmse    bootstrap  16.6      18.3   19.9   Preprocessor1_Mod…     6    0.00859
#> 4 rsq     bootstrap   0.548     0.665  0.765 Preprocessor1_Mod…     6    0.00859
#> 5 rmse    bootstrap  12.5      14.1   15.9   Preprocessor1_Mod…     3    0.0276 
#> 6 rsq     bootstrap   0.622     0.720  0.818 Preprocessor1_Mod…     3    0.0276 
#> # ℹ 14 more rows

Note that the output has the same number of rows as the collect_metrics() output: one for each unique pair of metric and workflow.

This is very helpful for validation sets. Other resampling methods generate replicated performance statistics. We can compute simple interval estimates using the mean and standard error for those. Validation sets produce only one estimate, and these bootstrap methods are probably the best option for obtaining interval estimates.

Breaking change: relocation of ellipses

We’ve made a breaking change in argument order for several functions in the package (and downstream packages like finetune and workflowsets). Ellipses (…) are now used consistently in the package to require optional arguments to be named. For functions that previously had unused ellipses at the end of the function signature, they have been moved to follow the last argument without a default value, and several other functions that previously did not have ellipses in their signatures gained them. This applies to methods for augment(), collect_predictions(), collect_metrics(), select_best(), show_best(), and conf_mat_resampled().

Compute new metrics without re-fitting

We’ve also added a new function, compute_metrics(), that allows for calculating metrics that were not used when evaluating against resamples. For example, consider our xgb_res object. Since we didn’t supply any metrics to evaluate, and this model is a regression model, tidymodels selected RMSE and R² as defaults:

collect_metrics(xgb_res)
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric .estimator   mean     n std_err .config              
#>   <int>      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
#> 1     2    0.00204 rmse    standard   19.7      25  0.262  Preprocessor1_Model01
#> 2     2    0.00204 rsq     standard    0.659    25  0.0314 Preprocessor1_Model01
#> 3     6    0.00859 rmse    standard   18.0      25  0.260  Preprocessor1_Model02
#> 4     6    0.00859 rsq     standard    0.607    25  0.0270 Preprocessor1_Model02
#> 5     3    0.0276  rmse    standard   14.0      25  0.267  Preprocessor1_Model03
#> 6     3    0.0276  rsq     standard    0.710    25  0.0237 Preprocessor1_Model03
#> # ℹ 14 more rows

In the past, if you wanted to evaluate that workflow against a performance metric that you hadn’t included in your tune_grid() run, you’d need to re-run tune_grid(), fitting models and predicting new values all over again. Now, using the compute_metrics() function, you can use the tune_grid() output you’ve already generated and compute any number of new metrics without having to fit any more models as long as you use the control option save_pred = TRUE when tuning.

So, say I want to additionally calculate Huber Loss and Mean Absolute Percent Error. I just pass those metrics along with the tuning result to compute_metrics(), and the result looks just like collect_metrics() output for the metrics originally calculated:

compute_metrics(xgb_res, metric_set(huber_loss, mape))
#> # A tibble: 20 × 8
#>    mtry learn_rate .metric    .estimator  mean     n std_err .config            
#>   <int>      <dbl> <chr>      <chr>      <dbl> <int>   <dbl> <chr>              
#> 1     2    0.00204 huber_loss standard    18.3    25  0.232  Preprocessor1_Mode…
#> 2     2    0.00204 mape       standard    94.4    25  0.0685 Preprocessor1_Mode…
#> 3     6    0.00859 huber_loss standard    16.7    25  0.229  Preprocessor1_Mode…
#> 4     6    0.00859 mape       standard    85.7    25  0.178  Preprocessor1_Mode…
#> 5     3    0.0276  huber_loss standard    12.6    25  0.230  Preprocessor1_Mode…
#> 6     3    0.0276  mape       standard    64.4    25  0.435  Preprocessor1_Mode…
#> # ℹ 14 more rows

Easily pivot resampled metrics

Finally, the collect_metrics() method for tune results recently gained a new argument, type, indicating the shape of the returned metrics. The default, type = "long", is the same shape as before. The argument value type = "wide" will allot each metric its own column, making it easier to compare metrics across different models.

collect_metrics(xgb_res, type = "wide")
#> # A tibble: 10 × 5
#>    mtry learn_rate .config                rmse   rsq
#>   <int>      <dbl> <chr>                 <dbl> <dbl>
#> 1     2    0.00204 Preprocessor1_Model01  19.7 0.659
#> 2     6    0.00859 Preprocessor1_Model02  18.0 0.607
#> 3     3    0.0276  Preprocessor1_Model03  14.0 0.710
#> 4     2    0.0371  Preprocessor1_Model04  12.3 0.728
#> 5     5    0.00539 Preprocessor1_Model05  18.8 0.595
#> 6     9    0.0110  Preprocessor1_Model06  17.4 0.577
#> # ℹ 4 more rows

Under the hood, this is indeed just a pivot_wider() call. We’ve found that it’s time-consuming and error-prone to programmatically determine identifying columns when pivoting resampled metrics, so we’ve localized and thoroughly tested the code that we use to do so with this feature.

More love for the Brier score

Tuning and resampling functions use default metrics when the user does not specify a custom metric set. For regression models, these are RMSE and R². For classification, accuracy and the area under the ROC curve were the default. We’ve also added the Brier score to the default classification metric list.

Acknowledgements

As always, we’re appreciative of the community contributors who helped make this release happen: @AlbertoImg, @dramanica, @epiheather, @joranE, @jrosell, @jxu, @kbodwin, @kenraywilliams, @KJT-Habitat, @lionel-, @marcozanotti, @MasterLuke84, @mikemahoney218, @PathosEthosLogos, and @Peter4801.