tidymodels updates

  tidymodels, recipes, parsnip, yardstick, rsample, corrr, embed, tidypredict

  Max Kuhn, Edgar Ruiz, and Davis Vaughan

We’ve sent a few packages to CRAN recently. Here’s a recap of the changes (and some notes at the bottom):

recipes 0.1.6

Breaking Changes

  • Since 2018, a warning has been issued when the wrong argument was used in bake(recipe, newdata). The deprecation period is over and new_data is officially required.
  • Previously, if step_other() did not collapse any levels, it would still add an “other” level to the factor. This would lump new factor levels into “other” when data were baked (as step_novel() does). This no longer occurs, since it was inconsistent with ?step_other, which said that: “If no pooling is done the data are unmodified”.

New Operations:

  • step_normalize() centers and scales the data (if you are, like Max, too lazy to use two separate steps).
  • step_unknown() will convert missing data in categorical columns to “unknown” and update factor levels.

Other Changes:

  • If the threshold argument of step_other() is greater than one, it specifies the minimum sample size before the levels of the factor are collapsed into the “other” category. #289
  • step_knnimpute() can now pass two options to the underlying knn code, including the number of threads ( #323).
  • Due to changes by CRAN, step_nnmf() only works on versions of R >= 3.6.0 due to dependency issues.
  • step_dummy() and step_other() are now tolerant to cases where that step’s selectors do not capture any columns. In this case, no modifications to the data are made. ( #290, #348)
  • step_dummy() can now retain the original columns that are used to make the dummy variables by setting preserve = TRUE. ( #328)
  • step_other()‘s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing). ( #338)
  • step_pca(), step_kpca(), step_ica(), step_nnmf(), step_pls(), and step_isomap() now accept zero components. In this case, the original data are returned. Please use this with great care.

embed 0.0.3

Two new steps were added:

  • step_umap() was added for both supervised and unsupervised encodings.
  • step_woe() creates weight of evidence encodings. Thanks to Athos Petri Damiani for this.

rsample 0.0.5

parsnip 0.0.3.1

Unplanned release based on CRAN requirements for Solaris.

Breaking Changes

  • The method that parsnip uses to store the model information has changed. Any custom models from previous versions will need to use the new method for registering models. The methods are detailed in ?get_model_env and the package vignette for adding models.

  • The mode needs to be declared for models that can be used for more than one mode prior to fitting and/or translation.

  • For surv_reg(), the engine that uses the survival package is now called survival instead of survreg.

  • For glmnet models, the full regularization path is always fit regardless of the value given to penalty. Previously, the model was fit by passing penalty to glmnet's lambda argument, and the model could only make predictions at those specific values. (#195)

New Features

  • add_rowindex() can create a column called .row to a data frame.

  • If a computational engine is not explicitly set, a default will be used. Each default is documented on the corresponding model page. A warning is issued at fit time unless verbosity is zero.

  • nearest_neighbor() gained a multi_predict method. The multi_predict() documentation is a little better organized.

  • A suite of internal functions were added to help with upcoming model tuning features.

  • A parsnip object always saved the name(s) of the outcome variable(s) for proper naming of the predicted values.

corrr 0.4

New features

  • New function called dice() function, wraps focus(x,..., mirror = TRUE)
  • A new retract() function does the opposite of stretch()
  • A new argument was added to stretch() called remove.dups. It removes duplicates with out removing all NAs.

Improvements

  • correlate()'s interface for databases was improved. It now only calculates unique pairs, and simplifies the formula that ultimately runs in-database. We also re-added the vignette to the package, which is also available on the site as an article

tidypredict 0.4.3

New models

The new version is now able to parse the following models:

  • cubist(), from the Cubist package
  • ctree(), from the partykit package
  • XGBoost trained models, via the xgboost package

New features

  • Integration with broom's tidy() function. It works with Regression models only
  • Adds support for parsnip fitted models: lm, randomForest, ranger, and earth
  • Adds as_parsed_model() function. It adds the proper class components to the list. This allows any model exported in the correct spec to be read in by tidypredict. See the Save Models and Non-R models for more information

Improvements

  • Now supports classification models from ranger

Website

The package’s official website has been expanded greatly. Here are some highlights:

  • An article per each supported model, they are found under Model List
  • A how to guide to save and reload models, link here
  • How to integrate non-R models to tidypredict, link here

yardstick 0.0.4

New Metrics

Two new metrics have been added to yardstick:

  • iic() is a numeric metric for computing the index of ideality of correlation. It is a potential alternative to the traditional correlation coefficient, and has been used in QSAR models ( #115).
  • average_precision() is a probability metric that can be used as an alternative to pr_auc(). It has the benefit of avoiding any issues of ambiguity in the edge case where recall == 0 and the current number of false positives is 0.

Improvements

  • pr_curve() (and by extension pr_auc()) has been greatly improved to better handle edge cases when duplicate class probability values are present. Additionally, the first precision value in the curve is now a 1, rather than an NA, which results in a more practical curve, and generates a more correct AUC value ( #93).
  • Each metric function now has a direction attribute, which specifies the direction required for optimization, either minimization or maximization.
  • Documentation for class probability metrics has been improved with more informative examples ( #100).
  • mn_log_loss() now uses the min/max rule before computing the log of the estimated probabilities to avoid problematic undefined log values ( #103).

Upcoming Changes and Directions

We are currently working on two general use packages: workflows and tune. The former bundles together recipes, model object, and other items so that there can be single fit() and predict() methods. tune will have tools for… um… tuning models. We are hoping to make these public in the next month or so.

There will be some changes to accommodate model tuning. The dials package has been re-factored substantially (see the current GH master branch) and there were some small interfaces changes to recipes too (mostly backwards compatible and also on GH). We are pretty close to end of “Phase I” of our tidymodels work.