Photo by Oliver Olah on Unsplash
We’re stoked to announce tidymodels now fully supports sparse data from end to end. We have been working on this for over 5 years. This is an extension of the work we have done previously with blueprints, which would carry the data sparsely some of the way.
You will need recipes 1.2.0, parsnip 1.3.0, workflows 1.2.0 or later for this to work.
What are sparse data?
The term sparse data refers to a data set containing many zeroes. Sparse data appears in all kinds of fields and can be produced in a number of preprocessing methods. The reason why we care about sparse data is because of how computers store numbers. A 32-bit integer value takes 4 bytes to store. An array of 32-bit integers takes 40 bytes, and so on. This happens because each value is written down.
A sparse representation instead stores the locations and values of the non-zero entries. Suppose we have the following vector with 20 entries:
c(0, 0, 1, 0, 3, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
It could be represented sparsely using the 3 values positions = c(1, 3, 7)
, values = c(3, 5, 8)
, and length = 20
. Now, we have seven values to represent a vector of 20 elements. Since some modeling tasks contain even sparser data, this type of representation starts to show real benefits in terms of execution time and memory consumption.
The tidymodels set of packages has undergone several internal changes to allow it to represent data sparsely internally when it would be beneficial. These changes allow you to fit models that contain sparse data faster and more memory efficiently than before. Moreover, it allows you to fit models previously not possible due to them not fitting in memory.
Sparse matrix support
The first benefit of these changes is that recipe()
, prep()
, bake()
, fit()
, and
predict()
now accept sparse matrices created using the Matrix package.
The permeability_qsar
data set from the modeldata package contains quite a lot of zeroes in the predictors, so we will use it as a demonstration. Starting by coercing it into a sparse matrix.
library(tidymodels)
library(Matrix)
permeability_sparse <- as(as.matrix(permeability_qsar), "sparseMatrix")
We can now use this sparse matrix in our code the same way as a dense matrix or data frame:
rec_spec <- recipe(permeability ~ ., data = permeability_sparse) |>
step_zv(all_predictors())
mod_spec <- boost_tree("regression", "xgboost")
wf_spec <- workflow(rec_spec, mod_spec)
Model training has the usual syntax:
wf_fit <- fit(wf_spec, permeability_sparse)
as does prediction:
predict(wf_fit, permeability_sparse)
#> # A tibble: 165 × 1
#> .pred
#> <dbl>
#> 1 10.5
#> 2 1.50
#> 3 13.1
#> 4 1.10
#> 5 1.25
#> 6 0.738
#> 7 29.3
#> 8 2.44
#> 9 36.3
#> 10 4.31
#> # ℹ 155 more rows
Note that only some models/engines work well with sparse data. These are all listed here https://www.tidymodels.org/find/sparse/. If the model doesn’t support sparse data, it will be coerced into the default non-sparse representation and used as usual.
With a few exceptions, it should work like any other data set. However, this approach has two main limitations. The first is that we are limited to regression tasks since the outcome has to be numeric to be part of the sparse matrix.
The second limitation is that it only works with non-formula methods for parsnip and workflows. This means that you can use a recipe with add_recipe()
or select variables directly with add_variables()
when using a workflow. And you need to use fit_xy()
instead of fit()
when using a parsnip object by itself.
If this is of interest we also have a https://www.tidymodels.org/ post about using sparse matrices in tidymodels.
Sparse data from recipes steps
Where this sparsity support really starts to shine is when the recipe we use will generate sparse data. They come in two flavors, sparsity creation steps and sparsity preserving steps. Both listed here: https://www.tidymodels.org/find/sparse/.
Some steps like step_dummy()
, step_indicate_na()
, and
textrecipes::step_tf()
will almost always produce a lot of zeroes. We take advantage of that by generating it sparsely when it is beneficial. If these steps end up producing sparse vectors, we want to make sure the sparsity is preserved. A couple of handfuls of steps, such as step_impute_mean()
and step_scale(),
have been updated to be able to work efficiently with sparse vectors. Both types of steps are detailed in the above-linked list of compatible methods.
What this means in practice is that if you use a model/engine that supports sparse data and have a recipe that produces enough sparse data, then the steps will switch to produce sparse data by using a new sparse data format to store the data (when appropriate) as the recipe is being processed. Then if the model can accept sparse objects, we convert the data from our new sparse format to a standard sparse matrix object. Increasing performance when possible while preserving performance otherwise.
Below is a simple recipe using the ames
data set. step_dummy()
is applied to all the categorical predictors, leading to a significant amount of zeroes.
rec_spec <- recipe(Sale_Price ~ ., data = ames) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
mod_spec <- boost_tree("regression", "xgboost")
wf_spec <- workflow(rec_spec, mod_spec)
When we go to fit it now, it takes around 125ms and allocates 37.2MB. Compared to before these changes it would take around 335ms and allocate 67.5MB.
wf_fit <- fit(wf_spec, ames)
We see similar speedups when we predictor with around 20ms and 25.2MB now, compared to around 60ms and 55.6MB before.
predict(wf_fit, ames)
#> # A tibble: 2,930 × 1
#> .pred
#> <dbl>
#> 1 208649.
#> 2 115339.
#> 3 148634.
#> 4 239770.
#> 5 190082.
#> 6 184604.
#> 7 208572.
#> 8 177403
#> 9 261000.
#> 10 198604.
#> # ℹ 2,920 more rows
These improvements are tightly related to memory allocation, which depends on the sparsity of the data set produced by the recipe. This is why it is hard to say how much benefit you will see. We have seen orders of magnitudes of improvements, both in terms of time and memory allocation. We have also been able to fit models where previously the data was too big to fit in memory.
Please see the post on tidymodels.org, which goes into more detail about when you are likely to benefit from this and how to change your recipes and workflows to take full advantage of this new feature.
There is also a https://www.tidymodels.org/ post going into a bit more detail about how to use recipes to produce sparse data.