Updates for recipes extension packages

We’re tickled pink to announce the releases of extension packages that followed the recent release of recipes 0.2.0. recipes is a package for preprocessing data before using it in models or visualizations. You can think of it as a mash-up of model.matrix() and dplyr.

You can install the these updates from CRAN with:

install.packages("embed")
install.packages("themis")
install.packages("textrecipes")

The NEWS files are linked here for each package; We will go over some of the bigger changes within and between these packages in this post. A lot of the smaller changes were done to make sure that these extension packages are up to the same standard as recipes itself.

library(recipes)
library(themis)
library(textrecipes)
library(embed)
library(modeldata)
set.seed(1234)

themis

A new step step_smotenc() was added thanks to Robert Gregg. This step applies the SMOTENC algorithm to synthetically generate observations from minority classes. The SMOTENC method can handle a mix of categorical and numerical predictors, which was not possible using the existing SMOTE method which could only operate on numeric predictors.

The hpc_data illustrates this use case neatly. The data set contains characteristics of HPC Unix jobs and how long they took to run (the outcome column is class). The outcome is not that balanced, with some classes having almost 10 times fewer observations than others. One way to deal with an imbalance like this is to over-sample the minority observations to mitigate the imbalance.

data(hpc_data)

hpc_data %>% count(class)
#> # A tibble: 4 × 2
#>   class     n
#>   <fct> <int>
#> 1 VF     2211
#> 2 F      1347
#> 3 M       514
#> 4 L       259

Using step_smotenc(), with the over_ratio argument, we can make sure that all classes are over-sampled to have no less than half of the observations of the largest class.

up_rec <- recipe(class ~ ., data = hpc_data) %>%
  step_smotenc(class, over_ratio = 0.5) %>%
  prep()

up_rec %>%
  bake(new_data = NULL) %>%
  count(class)
#> # A tibble: 4 × 2
#>   class     n
#>   <fct> <int>
#> 1 VF     2211
#> 2 F      1347
#> 3 M      1105
#> 4 L      1105

The method that was implemented in embed now has standalone functions to apply these algorithms without having to create a recipe.

smotenc(hpc_data, "class", over_ratio = 0.5)
#> # A tibble: 5,768 × 8
#>    protocol compounds input_fields iterations num_pending  hour day   class
#>    <fct>        <dbl>        <dbl>      <dbl>       <dbl> <dbl> <fct> <fct>
#>  1 E              997          137         20           0  14   Tue   F    
#>  2 E               97          103         20           0  13.8 Tue   VF   
#>  3 E              101           75         10           0  13.8 Thu   VF   
#>  4 E               93           76         20           0  10.1 Fri   VF   
#>  5 E              100           82         20           0  10.4 Fri   VF   
#>  6 E              100           82         20           0  16.5 Wed   VF   
#>  7 E              105           88         20           0  16.4 Fri   VF   
#>  8 E               98           95         20           0  16.7 Fri   VF   
#>  9 E              101           91         20           0  16.2 Fri   VF   
#> 10 E               95           92         20           0  10.8 Wed   VF   
#> # … with 5,758 more rows

textrecipes

We added the functions all_tokenized() and all_tokenized_predictors() to more easily select tokenized columns, similar to the existing all_numeric() and all_numeric_predictors() selectors in recipes.

The most important step in textrecipes is step_tokenize(), as you need it to generate tokens that can be modified by other steps. We have found that this function has gotten overloaded with functionality as more and more support for different types of tokenization was added. To address this, we have created new specialized tokenization steps; step_tokenize() has gotten cousin steps step_tokenize_bpe(), step_tokenize_sentencepiece(), and step_tokenize_wordpiece() which wrap tokenizers.bpe, sentencepiece, and wordpiece respectively.

In addition to being easier to manage code-wise, these new functions also allow for more compact, more readable code with better tab completion.

data(tate_text)

# Old
tate_rec <- recipe(~., data = tate_text) %>%
 step_tokenize(
    text,
    engine = "tokenizers.bpe",
    training_options = list(vocab_size = 1000)
  )

# New
tate_rec <- recipe(~., data = tate_text) %>%
  step_tokenize_bpe(medium, vocabulary_size = 1000)

embed

step_feature_hash() is now soft deprecated in embed in favor of step_dummy_hash() in textrecipes. The embed version uses TensorFlow, which for some use cases is quite a dependency. One thing to keep an eye out for when moving over is that the textrecipes version uses num_terms instead of num_hash to denote the number of columns to output.

data(Sacramento)

# Old recipe
embed_rec <- recipe(price ~ zip, data = Sacramento) %>%
  step_feature_hash(zip, num_hash = 64)
#> Loaded Tensorflow version 2.8.0

# New recipe
textrecipes_rec <- recipe(price ~ zip, data = Sacramento) %>%
  step_dummy_hash(zip, num_terms = 64)

Acknowledgements

We’d like to extend our thanks to all of the contributors who helped make these releases possible!

themis: @coforfe, @EmilHvitfeldt, @emilyriederer, @jennybc, @OGuggenbuehl, and @RobertGregg.
textrecipes: @dgrtwo, @DiabbZegpi, @EmilHvitfeldt, @jcragy, @jennybc, @joeycouse, @lionel-, @NLDataScientist, @raj-hubber, and @topepo.
embed: @EmilHvitfeldt, @juliasilge, @naveranoc, @talegari, and @topepo.