We’re tickled pink to announce the releases of extension packages that followed the recent release of
recipes 0.2.0. recipes is a package for preprocessing data before using it in models or visualizations. You can think of it as a mash-up of
model.matrix()
and dplyr.
You can install the these updates from CRAN with:
install.packages("embed")
install.packages("themis")
install.packages("textrecipes")
The NEWS
files are linked here for each package; We will go over some of the bigger changes within and between these packages in this post. A lot of the smaller changes were done to make sure that these extension packages are up to the same standard as recipes itself.
library(recipes)
library(themis)
library(textrecipes)
library(embed)
library(modeldata)
set.seed(1234)
themis
A new step
step_smotenc()
was added thanks to
Robert Gregg. This step applies the
SMOTENC algorithm to synthetically generate observations from minority classes. The SMOTENC method can handle a mix of categorical and numerical predictors, which was not possible using the existing SMOTE method which could only operate on numeric predictors.
The hpc_data
illustrates this use case neatly. The data set contains characteristics of HPC Unix jobs and how long they took to run (the outcome column is class
). The outcome is not that balanced, with some classes having almost 10 times fewer observations than others. One way to deal with an imbalance like this is to over-sample the minority observations to mitigate the imbalance.
data(hpc_data)
hpc_data %>% count(class)
#> # A tibble: 4 × 2
#> class n
#> <fct> <int>
#> 1 VF 2211
#> 2 F 1347
#> 3 M 514
#> 4 L 259
Using
step_smotenc()
, with the over_ratio
argument, we can make sure that all classes are over-sampled to have no less than half of the observations of the largest class.
up_rec <- recipe(class ~ ., data = hpc_data) %>%
step_smotenc(class, over_ratio = 0.5) %>%
prep()
up_rec %>%
bake(new_data = NULL) %>%
count(class)
#> # A tibble: 4 × 2
#> class n
#> <fct> <int>
#> 1 VF 2211
#> 2 F 1347
#> 3 M 1105
#> 4 L 1105
The method that was implemented in embed now has standalone functions to apply these algorithms without having to create a recipe.
smotenc(hpc_data, "class", over_ratio = 0.5)
#> # A tibble: 5,768 × 8
#> protocol compounds input_fields iterations num_pending hour day class
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
#> 1 E 997 137 20 0 14 Tue F
#> 2 E 97 103 20 0 13.8 Tue VF
#> 3 E 101 75 10 0 13.8 Thu VF
#> 4 E 93 76 20 0 10.1 Fri VF
#> 5 E 100 82 20 0 10.4 Fri VF
#> 6 E 100 82 20 0 16.5 Wed VF
#> 7 E 105 88 20 0 16.4 Fri VF
#> 8 E 98 95 20 0 16.7 Fri VF
#> 9 E 101 91 20 0 16.2 Fri VF
#> 10 E 95 92 20 0 10.8 Wed VF
#> # … with 5,758 more rows
textrecipes
We added the functions
all_tokenized()
and
all_tokenized_predictors()
to more easily select tokenized columns, similar to the
existing all_numeric()
and all_numeric_predictors()
selectors in recipes.
The most important step in textrecipes is
step_tokenize()
, as you need it to generate tokens that can be modified by other steps. We have found that this function has gotten overloaded with functionality as more and more support for different types of tokenization was added. To address this, we have created new specialized tokenization steps;
step_tokenize()
has gotten cousin steps
step_tokenize_bpe()
,
step_tokenize_sentencepiece()
, and
step_tokenize_wordpiece()
which wrap
tokenizers.bpe,
sentencepiece, and
wordpiece respectively.
In addition to being easier to manage code-wise, these new functions also allow for more compact, more readable code with better tab completion.
data(tate_text)
# Old
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(
text,
engine = "tokenizers.bpe",
training_options = list(vocab_size = 1000)
)
# New
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize_bpe(medium, vocabulary_size = 1000)
embed
step_feature_hash()
is now soft deprecated in embed in favor of
step_dummy_hash()
in textrecipes. The embed version uses TensorFlow, which for some use cases is quite a dependency. One thing to keep an eye out for when moving over is that the textrecipes version uses num_terms
instead of num_hash
to denote the number of columns to output.
data(Sacramento)
# Old recipe
embed_rec <- recipe(price ~ zip, data = Sacramento) %>%
step_feature_hash(zip, num_hash = 64)
#> Loaded Tensorflow version 2.8.0
# New recipe
textrecipes_rec <- recipe(price ~ zip, data = Sacramento) %>%
step_dummy_hash(zip, num_terms = 64)
Acknowledgements
We’d like to extend our thanks to all of the contributors who helped make these releases possible!
-
themis: @coforfe, @EmilHvitfeldt, @emilyriederer, @jennybc, @OGuggenbuehl, and @RobertGregg.
-
textrecipes: @dgrtwo, @DiabbZegpi, @EmilHvitfeldt, @jcragy, @jennybc, @joeycouse, @lionel-, @NLDataScientist, @raj-hubber, and @topepo.
-
embed: @EmilHvitfeldt, @juliasilge, @naveranoc, @talegari, and @topepo.