The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.
Since the beginning of 2021, we have been publishing
quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the
tidymodels
tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these posts from the past couple of months:
- Tuning hyperparameters with tidymodels is a delight
- censored 0.2.0
- The tidymodels is getting a whole lot faster
Since our last roundup post, there have been CRAN releases of 24 tidymodels packages. Here are links to their NEWS files:
- agua (0.1.2)
- baguette (1.0.1)
- broom (1.0.4)
- butcher (0.3.2)
- censored (0.2.0)
- dials (1.2.0)
- discrim (1.0.1)
- embed (1.1.0)
- finetune (1.1.0)
- hardhat (1.3.0)
- modeldata (1.1.0)
- parsnip (1.1.0)
- recipes (1.0.6)
- rules (1.0.2)
- spatialsample (0.3.0)
- stacks (1.0.2)
- textrecipes (1.0.3)
- themis (1.0.1)
- tidyclust (0.1.2)
- tidypredict (0.5)
- tune (1.1.1)
- workflows (1.1.3)
- workflowsets (1.0.1)
- yardstick (1.2.0)
We’ll highlight a few especially notable changes below: more informative errors and faster code. First, loading the collection of packages:
library(tidymodels)
library(embed)
data("ames", package = "modeldata")
More informative errors
In the last few months we have been focused on refining error messages so that they are easier for the users to pinpoint what went wrong and where. Since the modeling pipeline can be quite complicated, getting uninformative errors is a no-go.
Across the tidymodels, error messages will now indicate the user-facing function that caused the error rather than the internal function that it came from.
From dials, an error that looked like
degree(range = c(1L, 5L))
#> Error in `new_quant_param()`:
#> ! Since `type = 'double'`, please use that data type for the range.
Now says that the error came from degree()
rather than new_quant_param()
degree(range = c(1L, 5L))
#> Error in `degree()`:
#> ! Since `type = 'double'`, please use that data type for the range.
The same thing can be seen with the yardstick metrics
mtcars |>
accuracy(vs, am)
#> Error in `dplyr::summarise()`:
#> ℹ In argument: `.estimate = metric_fn(truth = vs, estimate = am, na_rm =
#> na_rm)`.
#> Caused by error in `validate_class()`:
#> ! `truth` should be a factor but a numeric was supplied.
which now errors much more informatively
mtcars |>
accuracy(vs, am)
#> Error in `accuracy()`:
#> ! `truth` should be a factor, not a `numeric`.
Lastly, one of the biggest improvements came in recipes, which now shows which step caused the error instead of saying it happened in
prep()
or
bake()
. This is a huge improvement since preprocessing pipelines which often string together many preprocessing steps.
Before
recipe(~., data = ames) |>
step_novel(Neighborhood, new_level = "Gilbert") |>
prep()
#> Error in `prep()`:
#> ! Columns already contain the new level: Neighborhood
Now
recipe(~., data = ames) |>
step_novel(Neighborhood, new_level = "Gilbert") |>
prep()
#> Error in `step_novel()`:
#> Caused by error in `prep()` at recipes/R/recipe.R:437:8:
#> ! Columns already contain the new level: Neighborhood
Especially when calls to recipes functions are deeply nested inside the call stack, like in fit_resamples()
or tune_grid()
, these changes make a big difference.
Things are getting faster
As we have written about in The tidymodels is getting a whole lot faster and Writing performant code with tidy tools, we have been working on tightening up the performance of the tidymodels code. These changes are mostly related to the infrastructure code, meaning that the speedup will bring you to closer underlying implementations.
A different kind of speedup is found with the addition of the step_pca_truncated() step added in the embed package.
Principal Component Analysis is a really powerful and fast method for dimensionality reduction of large data sets. However, for data with many columns, it can be computationally expensive to calculate all the principal components.
step_pca_truncated()
works in much the same way as
step_pca()
but it only calculates the number of components it needs
pca_normal <- recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = 3)
pca_truncated <- recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors()) |>
step_pca_truncated(all_numeric_predictors(), num_comp = 3)
tictoc::tic()
prep(pca_normal) |> bake(ames)
#> # A tibble: 2,930 × 4
#> Sale_Price PC1 PC2 PC3
#> <int> <dbl> <dbl> <dbl>
#> 1 215000 -31793. 4151. -197.
#> 2 105000 -12198. -611. -524.
#> 3 172000 -14911. -265. 7568.
#> 4 244000 -12072. -1813. 918.
#> 5 189900 -14418. -345. -302.
#> 6 195500 -10704. -1367. -204.
#> 7 213500 -5858. -2805. 114.
#> 8 191500 -5932. -2762. 131.
#> 9 236500 -6368. -2862. 325.
#> 10 189000 -8368. -2219. 126.
#> # ℹ 2,920 more rows
tictoc::toc()
#> 0.782 sec elapsed
tictoc::tic()
prep(pca_truncated) |> bake(ames)
#> # A tibble: 2,930 × 4
#> Sale_Price PC1 PC2 PC3
#> <int> <dbl> <dbl> <dbl>
#> 1 215000 -31793. 4151. -197.
#> 2 105000 -12198. -611. -524.
#> 3 172000 -14911. -265. 7568.
#> 4 244000 -12072. -1813. 918.
#> 5 189900 -14418. -345. -302.
#> 6 195500 -10704. -1367. -204.
#> 7 213500 -5858. -2805. 114.
#> 8 191500 -5932. -2762. 131.
#> 9 236500 -6368. -2862. 325.
#> 10 189000 -8368. -2219. 126.
#> # ℹ 2,920 more rows
tictoc::toc()
#> 0.162 sec elapsed
The speedup will be orders of magnitude larger for very wide data.
Acknowledgements
We’d like to thank those in the community that contributed to tidymodels in the last quarter:
- agua: @hfrick, @simonpcouch, and @topepo.
- baguette: @simonpcouch, and @topepo.
- broom: @benwhalley, @dgrtwo, @egosv, @hfrick, @JorisChau, @mccarthy-m-g, @MichaelChirico, @paige-cho, @PoGibas, @rsbivand, @simonpcouch, @ste-tuf, and @victor-vscn.
- butcher: @ashbythorpe, @DavisVaughan, @hfrick, @juliasilge, @rdavis120, @rkb965, and @simonpcouch.
- censored: @brunocarlin, and @hfrick.
- dials: @amin0511ss, @EmilHvitfeldt, @hfrick, and @simonpcouch.
- discrim: @EmilHvitfeldt, and @tomwagstaff-opml.
- embed: @EmilHvitfeldt, @jackobenco016, and @skasowitz.
- finetune: @Freestyleyang, @simonpcouch, and @topepo.
- hardhat: @cregouby, @DavisVaughan, @EmilHvitfeldt, @frank113, and @mikemahoney218.
- modeldata: @EmilHvitfeldt, and @topepo.
- parsnip: @EmilHvitfeldt, @emmafeuer, @exsell-jc, @hfrick, @mariamaseng, @SHo-JANG, @simonpcouch, @topepo, and @Tripartio.
- recipes: @AshesITR, @EmilHvitfeldt, @hfrick, @jjcurtin, @lang-benjamin, @lbui30, @PeterKoffeldt, @rdavis120, @simonpcouch, @StevenWallaert, @tellyshia, @topepo, @ttrodrigz, and @zecojls.
- rules: @EmilHvitfeldt, @hfrick, @jonthegeek, and @topepo.
- spatialsample: @hfrick, @mikemahoney218, and @RaymondBalise.
- stacks: @amin0511ss, @gundalav, @jrosell, @juliasilge, @pbulsink, @rdavis120, and @simonpcouch.
- textrecipes: @apsteinmetz, @EmilHvitfeldt, @gary-mu, @hfrick, and @nipnipj.
- themis: @carlganz, @EmilHvitfeldt, @hfrick, @nipnipj, @rmurphy49, and @rowanjh.
- tidyclust: @EmilHvitfeldt, @hfrick, @hsbadr, @jonthegeek, and @simonpcouch.
- tidypredict: @edgararuiz, and @sdcharle.
- tune: @BenoitLondon, @cphaarmeyer, @hfrick, @jthomasmock, @mrjujas, @MxNl, @nabsiddiqui, @rdavis120, @SHo-JANG, @simonpcouch, @topepo, @walrossker, and @yusuftengriverdi.
- workflows: @simonpcouch.
- workflowsets: @EmilHvitfeldt, @gsimchoni, and @simonpcouch.
- yardstick: @77makr, @burch-cm, @EmilHvitfeldt, @idavydov, @kadyb, @mawardivaz, @mikemahoney218, @moloscripts, @nyambea, @SHo-JANG, @simdadim, and @simonpcouch.
We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!