The parsnip
package is now
on CRAN. It is designed to solve a specific problem related to model fitting in R, the interface. Many functions have different interfaces and arguments names and parsnip
standardizes the interface for fitting models as well as the return values. When using parsnip
, you don’t have to remember each interface and its unique set of argument names to easily move between R packages.
This is the first of several blog posts that discuss the package. More information can be found at the
parsnip
pkgdown site.
The Problem
The interface problem is something that I’ve talked about for some time. I’ll use logistic regression to demonstrate the issue here. Many of us are familiar with the standard glm
syntax for fitting models^[This syntax predates R and was formally described in the 1992 book Statistical Models in S. It’s older than
debian.]. It uses the formula method and, to fit a logistic model, the family = binomial
argument is required. Suppose that we want to apply some regularization to the model. A popular choice is the
glmnet
package, but its interface is very different from glm
:
- It does not use the formula method and expects the predictors in a matrix (so dummy variables must be pre-computed).
- Nonstandard
family
objects are used. The argument isfamily = "binomial"
.
While each of these is not a significant issue, these types of inconsistencies are common across R packages. The only way to avoid them is to only use a single package.
There is a larger issue when you want to fit the same model via tensorflow
's
keras
interface. keras
has a beautiful approach to sequentially assembling deep learning models, but it has very little resemblance to the traditional approaches. Creating a simple logistic model requires the user to learn and use drastically different syntax.
There is also inconsistency in how different packages return predictions. Most R packages use the predict()
function to make predictions on new data. If we want to get class probabilities for our logistic regression model, using predict(obj, newdata, type = "response")
will return a vector of probabilities for the second level of our factor. However, this convention can be wildly inconsistent across R packages. Examples are:
Function | Package | Code |
---|---|---|
glm |
stats |
predict(obj, type = "response") |
lda |
MASS |
predict(obj) |
gbm |
gbm |
predict(obj, type = "response", n.trees) |
mda |
mda |
predict(obj, type = "posterior") |
rpart |
rpart |
predict(obj, type = "prob") |
Weka |
RWeka |
predict(obj, type = "probability") |
logitboost |
LogitBoost |
predict(obj, type = "raw", nIter) |
pamr.train |
pamr |
pamr.predict(obj, type = "posterior") |
An added complication is that some models can create predictions across multiple submodels at once. For example, boosted trees fit using \(i\)
iterations can produce predictions using less than \(i\)
iterations (effectively creating a different prediction model). This can lead to further inconsistencies.
These issues, in aggregate, can be grating. Sometimes it might feel like:
“Is R working for me or am I working for R?”
parsnip
aims to decrease the frustration for people who want to evaluate different types of models on a data set. This is very much related to our
guidelines for developing modeling packages (on which we are still looking for feedback).
parsnip syntax
To demonstrate, we’ll use mtcars
once again.
library(parsnip)
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────── tidymodels 0.0.1 ──
#> ✔ ggplot2 3.1.0 ✔ recipes 0.1.4
#> ✔ tibble 1.4.2 ✔ broom 0.5.0
#> ✔ purrr 0.2.5 ✔ yardstick 0.0.2
#> ✔ dplyr 0.7.8 ✔ infer 0.3.1
#> ✔ rsample 0.0.3
#> ── Conflicts ────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::accumulate() masks foreach::accumulate()
#> ✖ rsample::fill() masks tidyr::fill()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ yardstick::get_weights() masks keras::get_weights()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ rsample::populate() masks Rcpp::populate()
#> ✖ recipes::step() masks stats::step()
#> ✖ yardstick::tidy() masks broom::tidy(), recipes::tidy(), rsample::tidy()
#> ✖ purrr::when() masks foreach::when()
set.seed(4831)
split <- initial_split(mtcars, props = 9/10)
car_train <- training(split)
car_test <- testing(split)
Let’s preprocess these data to center and scale the predictors. We’ll use a basic recipe to do this:
car_rec <-
recipe(mpg ~ ., data = car_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
prep(training = car_train, retain = TRUE)
# The processed versions are:
train_data <- juice(car_rec)
test_data <- bake(car_rec, car_test)
To use parsnip
, you start with a model specification. This is a simple object that defines the intent of the model. Since we will be using linear regression of various flavors, our first step is a simple statement:
car_model <- linear_reg()
car_model
#> Linear Regression Model Specification (regression)
That’s pretty underwhelming because we haven’t given it any details yet. parsnip
offers a variety of methods to fit this general model. We will use ordinary least squares, but could also use penalized least squares too (via the lasso, ridge regression, Bayesian estimation, dropout, etc). We differentiate these cases by the computational engines, which is a combination of the estimation type, such as least squares, and the implemention. The latter could be an R package or some other computing platform like Spark or Tensorflow.
To start simple, let’s use lm
:
lm_car_model <-
car_model %>%
set_engine("lm")
lm_car_model
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
There are no additional arguments that we should specify here, so let’s jump to fitting the actual model. Our two choices at this point are whether to use fit()
or fit_xy()
. fit()
takes a formula, while fit_xy()
takes objects for the predictors and outcome(s). Recall that glm
and lm
only allow for formulas, while glmnet
only takes a matrix of predictors and an outcome. parsnip
allows for either so that you can avoid having to think about what the underlying model function requires. To demonstrate, let’s make a simple model:
lm_fit <-
lm_car_model %>%
fit(mpg ~ ., data = car_train)
# or
lm_car_model %>%
fit_xy(x = select(car_train, -mpg), y = select(car_train, mpg))
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat
#> 23.1945 -1.6396 0.0439 -0.0301 0.8517
#> wt qsec vs am gear
#> -6.0165 0.8668 0.8757 2.4274 -0.4658
#> carb
#> 0.7889
If we had predictors that were factors, fit()
would be a better choice. If the underlying model takes a formula, the formula and data is passed directly to the function without modification. Otherwise, fit()
applies the standard model.matrix()
machinery to do the preprocessing and converts the data to the required format (e.g. a matrix for glmnet
). Note that, for Spark tables, fit()
must be used.
It should be noted that lm_car_model
is a
model specification object while lm_fit
is a
model fit object.
More Engines
The value of parsnip
starts to show when we want to try different engines. Let’s take our same model and use Bayesian estimation to fit the parameters using Stan. We can change the engine to do so:
stan_car_model <-
car_model %>%
set_engine("stan")
stan_car_model
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: stan
To fit this model, parsnip
calls stan_glm()
from the
rstanarm
package. If you want to pass in arguments to this function, just add them to set_engine
:
stan_car_model <-
car_model %>%
set_engine("stan", iter = 5000, prior_intercept = rstanarm::cauchy(0, 10), seed = 2347)
stan_car_model
#> Linear Regression Model Specification (regression)
#>
#> Engine-Specific Arguments:
#> iter = 5000
#> prior_intercept = rstanarm::cauchy(0, 10)
#> seed = 2347
#>
#> Computational engine: stan
The namespace was used to call cauchy()
since parsnip
does not fully attach the package when the model is fit.
The model can be fit in the same way. We’ll add a feature here; rstanarm
prints a lot of output when fitting. This can be helpful to diagnose issues but we’ll exclude it using a control function:
# don't print anything:
ctrl <- fit_control(verbosity = 0)
stan_fit <-
stan_car_model %>%
fit(mpg ~ ., data = car_train, control = ctrl)
stan_fit
#> parsnip model object
#>
#> stan_glm
#> family: gaussian [identity]
#> formula: mpg ~ .
#> observations: 24
#> predictors: 11
#> ------
#> Median MAD_SD
#> (Intercept) 23.6 24.1
#> cyl -1.5 1.6
#> disp 0.0 0.0
#> hp 0.0 0.0
#> drat 0.8 2.4
#> wt -5.4 3.1
#> qsec 0.8 0.9
#> vs 0.7 3.2
#> am 2.3 2.7
#> gear -0.4 2.1
#> carb 0.6 1.4
#>
#> Auxiliary parameter(s):
#> Median MAD_SD
#> sigma 3.1 0.6
#>
#> Sample avg. posterior predictive distribution of y:
#> Median MAD_SD
#> mean_PPD 20.6 0.9
#>
#> ------
#> * For help interpreting the printed output see ?print.stanreg
#> * For info on the priors used see ?prior_summary.stanreg
That was easy.
But wait, there’s more! Getting predictions for these models is simple and tidy. We’ve been working on coming up with a
standard for model predictions where the predictions always return a tibble that has the same number of rows as the data being predicted. This solves the frustrating issue of having new data with missing predictor values and a predict()
method that returns predictions for only the complete data. In that case, you have to match up the rows of the original data to the predicted values.
For regression, basic predictions come back in a column called .pred
:
predict(lm_fit, car_test)
#> # A tibble: 8 x 1
#> .pred
#> <dbl>
#> 1 17.5
#> 2 17.7
#> 3 11.0
#> 4 13.2
#> 5 13.2
#> 6 10.9
#> 7 31.9
#> 8 24.9
predict(stan_fit, car_test)
#> # A tibble: 8 x 1
#> .pred
#> <dbl>
#> 1 17.4
#> 2 17.9
#> 3 11.6
#> 4 13.6
#> 5 13.6
#> 6 10.9
#> 7 31.5
#> 8 24.9
This can be easily joined to the original data and the .
in the name is there to prevent duplicate name conflicts.
parsnip
also enables different types of predictions with a standard interface. To get interval estimates:
predict(lm_fit, car_test, type = "conf_int")
#> # A tibble: 8 x 2
#> .pred_lower .pred_upper
#> <dbl> <dbl>
#> 1 14.1 21.0
#> 2 11.6 23.8
#> 3 3.57 18.3
#> 4 7.41 18.9
#> 5 7.15 19.3
#> 6 6.39 15.5
#> 7 23.2 40.7
#> 8 19.0 30.9
# Not really a confidence interval but gives quantiles of
# the posterior distribution of the fitted values.
predict(stan_fit, car_test, type = "conf_int")
#> # A tibble: 8 x 2
#> .pred_lower .pred_upper
#> <dbl> <dbl>
#> 1 14.0 20.7
#> 2 12.0 23.9
#> 3 5.03 18.4
#> 4 8.37 18.9
#> 5 8.17 19.3
#> 6 6.38 15.4
#> 7 23.1 39.6
#> 8 19.1 30.8
As one might expect, the code to obtain these values using the original packages are very different from one another. parsnip
works to make the interface easy. A mapping between the available models and their prediction types is
here.
Standardized Arguments
Now let’s look at estimating this model using an L2 penalty (a.k.a weight decay, a.k.a ridge regression). There are a few ways of doing this. glmnet
is an obvious choice. While we don’t have to declare the size of the penalty at the time of model fitting, we’ll do so below for illustration.
x_mat <-
car_train %>%
select(-mpg) %>%
as.matrix()
glmnet(x = x_mat, y = car_train$mpg, alpha = 0, lambda = 0.1)
alpha = 0
tells glmnet
to only use an L2 penalty (as opposed to L1 and L2).
For keras
,
possible syntax could be:
lr_model <- keras_model_sequential()
lr_model %>%
layer_dense(units = 1, input_shape = dim(x_mat)[2], activation = 'linear',
kernel_regularizer = regularizer_l2(0.1))
early_stopping <- callback_early_stopping(monitor = 'loss', min_delta = 0.000001)
lr_model %>% compile(
loss = 'mean_squared_error',
optimizer = optimizer_adam(lr = 0.001)
)
lr_model %>%
fit(
x = x_mat,
y = car_train$mpg,
epochs = 1000,
batch_size = 1,
callbacks = early_stopping
)
This is very powerful but maybe it’s not something that you want to have to type more than once.
parsnip
model functions, like linear_reg()
, can also have main arguments that are standardized and avoid jargon like lambda
or kernel_regularizer
. Here, a model specification would be:
penalized <- linear_reg(mixture = 0, penalty = 0.1)
penalized
#> Linear Regression Model Specification (regression)
#>
#> Main Arguments:
#> penalty = 0.1
#> mixture = 0
penalty
is the amount of regularization penalty that we want to use. mixture
is only used for models like glmnet
that can fit different types of penalties, and is the proportion of the penalty that corresponds to weight decay (in other words, alpha
from above).
From here, the glmnet
model would be:
glmn_fit <-
penalized %>%
set_engine("glmnet") %>%
fit(mpg ~ ., data = car_train)
glmn_fit
#> parsnip model object
#>
#>
#> Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0, lambda = ~0.1)
#>
#> Df %Dev Lambda
#> [1,] 10 0.854 0.1
For keras
, we can add the other options (unrelated to the penalty) via set_engine()
:
early_stopping <- callback_early_stopping(monitor = 'loss', min_delta = 0.000001)
keras_fit <-
penalized %>%
set_engine("keras", epochs = 1000, batch_size = 1, callbacks = !!early_stopping) %>%
fit(mpg ~ ., data = car_train, control = ctrl)
keras_fit
#> parsnip model object
#>
#> Model
#> ___________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ===========================================================================
#> dense_1 (Dense) (None, 1) 11
#> ___________________________________________________________________________
#> dense_2 (Dense) (None, 1) 2
#> ===========================================================================
#> Total params: 13
#> Trainable params: 13
#> Non-trainable params: 0
#> ___________________________________________________________________________
The main arguments are standardized in parsnip
, so that logistic_reg()
and other functions use the same name, and are being standardized in other packages like
recipes
and
dials
.
What parsnip is and what it isn’t
Other packages, such as
caret
and mlr
, help to solve the R model API issue. These packages do a lot of other things too: preprocessing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip
is designed only to solve the interface issue. It is not designed to be a drop-in replacement for
caret
.
The
tidymodels
package collection, which includes parsnip
, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.
For example, fit()
and fit_xy()
do not involve recipes. It might seem natural to include a recipe interface like caret
does (and, originally, parsnip
did). The reason that recipes are excluded from fitting parsnip
objects is that you probably want to process the recipe once and use it across different models. To include it would link that specific recipe to each fitted model object.
As an alternative, we are working on a different object type that is similar to existing pipelines where a set of modeling activities can be woven together to represent the entire modeling process. To get an idea of the activities that we have in store for tidy modeling, look here.
What’s next
Subsequent blog posts on parsnip
will talk about the underlying architecture and choices that we made along the line (and why). We’ll also talk more about how parsnip
integrates with other tidymodels
packages, how quasiquotation can/should be used, and some other features that are
particularly interesting to us.