Q1 2025 tidymodels digest

  tidymodels

  Max Kuhn

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused.

We’ve sent a steady stream of tidymodels packages to CRAN recently. We usually release them in batches since many of our packages are tightly coupled with one another. Internally, this process is referred to as the “cascade” of CRAN submissions.

The post will update you on which packages have changed and the major improvements you should know about.

Here’s a list of the packages and their News sections:

Let’s look at a few specific updates.

Improvements in errors and warnings

A group effort was made to improve our error and warning messages across many packages. This started with an internal “upkeep week” (which ended up being 3-4 weeks) and concluded at the Tidy Dev Day in Seattle after posit::conf(2024).

The goal was to use new tools in the cli and rlang packages to make messages more informative than they used to be. For example, using:

tidy(pca_extract_trained, number = 3, type = "variances")

used to result in the error message:

Error in `match.arg()`:
! 'arg' should be one of "coef", "variance"

The new system references the function that you called and not the underlying base R function that actually errored. It also suggests a solution:

Error in `tidy()`:
! `type` must be one of "coef" or "variance", not "variances".
i Did you mean "variance"?

The rlang package created a set of standalone files that contain high-quality type checkers and related functions. This also improves the information that users get from an error. For example, using an inappropriate formula value in fit(linear_reg(), "boop", mtcars), the old message was:

Error in `fit()`:
! The `formula` argument must be a formula, but it is a <character>.

and now you see:

Error in `fit()`:
! `formula` must be a formula, not the string "boop".

This was a lot of work and we’re still aren’t finished. Two events helped us get as far as we did.

First, Simon Couch made the chores package (its previous name was “pal”), which enabled us to use AI tools to solve small-scope problems, such as converting old rlang error code to use the new cli syntax. I can’t overstate how much of a speed-up this was for us.

Second, at developer day, many external folks pitched in to make pull requests from a list of issues:

Organizing Tidy Dev Day issues.

Organizing Tidy Dev Day issues.

I love these sessions for many reasons, but mostly because we meet users and contributors to our packages in person and work with them on specific tasks.

There is a lot more to do here; we have a lot of secondary packages that would benefit from these improvements too.

Quantile regression in parsnip

One big update in parsnip was a new modeling mode of "quantile regression". Daniel McDonald and Ryan Tibshirani largely provided some inertia for this work based on their disease modeling framework.

You can generate quantile predictions by first creating a model specification, which includes the quantiles that you want to predict:

library(tidymodels)
tidymodels_prefer()

ames <- 
  modeldata::ames |> 
  mutate(Sale_Price = log10(Sale_Price)) |> 
  select(Sale_Price, Latitude)

quant_spec <- 
  linear_reg() |> 
  set_engine("quantreg") |> 
  set_mode("quantile regression", quantile_levels = c(0.1, 0.5, 0.9))
quant_spec
## Linear Regression Model Specification (quantile regression)
## 
## Computational engine: quantreg
## Quantile levels: 0.1, 0.5, and 0.9.

We’ll add some spline terms via a recipe and fit the model:

spline_rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_spline_natural(Latitude, deg_free = 10)

quant_fit <- 
  workflow(spline_rec, quant_spec) |> 
  fit(data = ames)

quant_fit
## ══ Workflow [trained] ═════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_spline_natural()
## 
## ── Model ──────────────────────────────────────────────────────────────
## Call:
## quantreg::rq(formula = ..y ~ ., tau = quantile_levels, data = data)
## 
## Coefficients:
##               tau= 0.1    tau= 0.5    tau= 0.9
## (Intercept) 4.71981123  5.07728741  5.25221335
## Latitude_01 1.22409173  0.70928577  0.79000849
## Latitude_02 0.19561816  0.04937750  0.02832633
## Latitude_03 0.16616065  0.02045910  0.14730573
## Latitude_04 0.30583648  0.08489487  0.15595080
## Latitude_05 0.21663212  0.02016258 -0.01110625
## Latitude_06 0.33541228  0.12005254  0.03006777
## Latitude_07 0.47732205  0.09146728  0.17394021
## Latitude_08 0.24028784  0.30450058  0.26144584
## Latitude_09 0.05840312 -0.14733781 -0.11911843
## Latitude_10 1.52800673  0.95994216  1.21750501
## 
## Degrees of freedom: 2930 total; 2919 residual

For prediction, tidymodels always returns a data frame with as many rows as the input data set (here: ames). The result for quantile predictions is a special vctrs class:

quant_pred <- predict(quant_fit, ames) 
quant_pred |> slice(1:4)
## # A tibble: 4 × 1
##   .pred_quantile
##        <qtls(3)>
## 1         [5.33]
## 2         [5.33]
## 3         [5.33]
## 4         [5.31]
class(quant_pred$.pred_quantile)
## [1] "quantile_pred" "vctrs_vctr"    "list"

where the output [5.31] shows the middle quantile.

We can expand the set of quantile predictions so that there are three rows for each source row in ames. There’s also an integer column called .row so that we can merge the data with the source data:

quant_pred$.pred_quantile[1]
## <quantiles[1]>
## [1] [5.33]
## # Quantile levels: 0.1 0.5 0.9
as_tibble(quant_pred$.pred_quantile[1])
## # A tibble: 3 × 3
##   .pred_quantile .quantile_levels  .row
##            <dbl>            <dbl> <int>
## 1           5.08              0.1     1
## 2           5.33              0.5     1
## 3           5.52              0.9     1

Here are the predicted quantile values:

quant_pred$.pred_quantile |> 
  as_tibble() |> 
  full_join(ames |> add_rowindex(), by = ".row") |> 
  arrange(Latitude) |> 
  ggplot(aes(x = Latitude)) + 
  geom_point(data = ames, aes(y = Sale_Price), alpha = 1 / 5) +
  geom_line(aes(y = .pred_quantile, col = format(.quantile_levels)), 
            show.legend = FALSE, linewidth = 1.5) 
10%, 50%, and 90% quantile predictions.

10%, 50%, and 90% quantile predictions.

For now, the new mode does not have many engines. We need to implement some performance statistics in the yardstick package before integrating these models into the whole tidymodels ecosystem.

In other news, we’ve added some additional neural network models based on some improvements in the brulee package. Namely, two-layer networks can be tuned for feed-forward networks on tabular data (using torch).

One other improvement has been simmering for a long time: the ability to exploit sparse data structures better. We’ve improved our fit() interfaces for the few model engines that can use sparsely encoded data. There is much more to come on this in a few months, especially around recipes, so stay tuned.

Finally, we’ve created a set of checklists that can be used when creating new models or engines. These are very helpful, even for us, since there is a lot of minutiae to remember.

Parallelism in tune

This was a small maintenance release mostly related to parallel processing. Up to now, tune facilitated parallelism using the foreach package. That package is mature but not actively developed, so we have been slowly moving toward using the future package(s).

The first step in this journey was to keep using foreach internally (but lean toward future) but to encourage users to move from directly invoking the foreach package and, instead, load and use the future package.

We’re now moving folks into the second stage. tune will now raise a warning when:

  • A parallel backend has been registered with foreach, and
  • No plan() has been specified with future.

This will allow users to transition their existing code to only future and allow us to update existing documentation and training materials.

We anticipate that the third stage, removing foreach entirely, will occur sometime before posit::conf(2025) in September.

Things to look forward to

We are working hard on a few major initiatives that we plan on showing off at posit::conf(2025).

First is integrated support for sparse data. The emphasis is on “data” because users can use a data frame of sparse vectors or the usual sparse matrix format. This is a big deal because it does not force you to convert non-numeric data into a numeric matrix format. Again, we’ll discuss this more in the future, but you should be able to use sparse data frames in parsnip, recipes, tune, etc.

The second initiative is the longstanding goal of adding postprocessing to tidymodels. Just as you can add a preprocessor to a model workflow, you will be able to add a set of postprocessing adjustments to the predictions your model generates. See our previous post for a sneak peek.

Finally, this year’s summer internship focuses on supervised feature selection methods. We’ll also have releases (and probably another package) for these tools.

These should come to fruition (and CRAN) before or around August 2025.

Acknowledgements

We want to sincerely thank everyone who contributed to these packages since their previous versions:

@AlbertoImg, @asb2111, @balraadjsings, @bcjaeger, @beansrowning, @BrennanAntone, @cheryldietrich, @chillerb, @conarr5, @corybrunson, @dajmcdon, @davidrsch, @Edgar-Zamora, @EmilHvitfeldt, @gaborcsardi, @gimholte, @grantmcdermott, @grouptheory, @hfrick, @ilaria-kode, @JamesHWade, @jesusherranz, @jkylearmstrong, @joranE, @joscani, @Joscelinrocha, @josho88, @joshuagi, @JosiahParry, @jrosell, @jrwinget, @KarlKoe, @kscott-1, @lilykoff, @lionel-, @LouisMPenrod, @luisDVA, @marcelglueck, @marcozanotti, @martaalcalde, @mattwarkentin, @mihem, @mitchellmanware, @naokiohno, @nhward, @npelikan, @obgeneralao, @owenjonesuob, @pbhogale, @Peter4801, @pgg1309, @reisner, @rfsaldanha, @rkb965, @RobLBaker, @RodDalBen, @SantiagoD999, @shum461, @simonpcouch, @szimmer, @talegari, @therealjpetereit, @topepo, @walkerjameschris, and @ZWael