tidypredict 1.0.0

  orbital, tidymodels, tidypredict

  Emil Hvitfeldt

We’re tickled pink to announce the release of version 1.0.0 of tidypredict. The main goal of tidypredict is to enable running predictions inside databases. It reads the model, extracts the components needed to calculate the prediction, and then creates an R formula that can be translated into SQL.

You can install them from CRAN with:

install.packages("tidypredict")

This blog post highlights the most important changes in this release, including faster computations for tree-based models, more efficient tree representations, glmnet model support, and a change in how random forests are handled. You can see a full list of changes in the release notes.

Improved output for random forest models

The previous version of tidypredict tidypredict_fit() would return a list of expressions, one for each tree, when applied to random forest models. This didn’t align with what is returned by other types of models. In version 1.0.0, this has been changed to produce a single, combined expression that reflects how predictions should be made.

This is technically a breaking change, but one we believe is worthwhile, as it provides a more consistent output for tidypredict_fit() and hides the technical details about how to combine trees from different packages.

Faster parsing of trees

The parsing of xgboost, partykit, and ranger models should now be substantially faster than before. Examples have been shown to be 10 to 200 times faster. Please note that larger models, more trees, and deeper trees still take some time to parse.

More efficient tree expressions

All trees, whether they are a single tree or part of a collection of trees, such as in boosted trees or random forests, are encoded as case_when() statements by tidypredict. This means that the following tree.

model <- partykit::ctree(mpg ~ am + cyl, data = mtcars)
model
#> 
#> Model formula:
#> mpg ~ am + cyl
#> 
#> Fitted party:
#> [1] root
#> |   [2] cyl <= 4: 26.664 (n = 11, err = 203.4)
#> |   [3] cyl > 4
#> |   |   [4] cyl <= 6: 19.743 (n = 7, err = 12.7)
#> |   |   [5] cyl > 6: 15.100 (n = 14, err = 85.2)
#> 
#> Number of inner nodes:    2
#> Number of terminal nodes: 3

Would be turned into the following case_when() statement.

case_when(
 cyl <= 4 ~ 26.6636363636364,
 cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
 cyl > 6 & cyl > 4 ~= 15.1
)

With this new update, we have taken advantage of the .default argument whenever possible, which should lead to faster predictions, as we no longer need to calculate redundant conditionals.

tidypredict_fit(model)
#> case_when(cyl <= 4 ~ 26.6636363636364, cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
#>     .default = 15.1)

Glmnet support

We now support the glmnet package. This package provides generalized linear models with lasso or elasticnet regularization.

The primary restriction when using a glmnet model with tidypredict() is that the model must have been fitted with the lambda argument set to a single value.

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 0.01)

tidypredict_fit(model)
#> 13.0081464696679 + (cyl * -0.0773532164346008) + (disp * 0.00969507138358544) + 
#>     (hp * -0.0192462098902709) + (drat * 0.816753237688302) + 
#>     (wt * -3.41564341709663) + (qsec * 0.758580151032383) + (vs * 
#>     0.277874296242861) + (am * 2.47356523820533) + (gear * 0.645144527527598) + 
#>     (carb * -0.300886812079305)

glmnet() computes a collection of models using many sets of penalty values. This can be very efficient, but for tidypredict, we need to predict with a single penalty. Note how, as we increase the penalty, the extracted expression correctly removes terms with coefficients of 0 instead of leaving them as (disp * 0).

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 1)

tidypredict_fit(model)
#> 35.3137765116027 + (cyl * -0.871451193824228) + (hp * -0.0101173960249783) + 
#>     (wt * -2.59443677687505)

tidypredict is used as the primary parser for models employed by the orbital package. This means that all the changes seen in this post also take effect when using orbital with tidymodels workflows. Such as using parsnip::linear_reg() with engine = "glmnet".

Acknowledgements

A big thank you to all the folks who helped make this release happen: @EmilHvitfeldt, and @jeroenjanssens.