We’re delighted to announce the release of textrecipes 0.0.1 on CRAN. textrecipes implements a collection of new steps for the recipes package to deal with text preprocessing. textrecipes is still in early development so any and all feedback is highly appreciated.
You can install it by running:
install.packages("textrecipes")
New steps
The steps introduced here can be split into 3 types, those that:
- convert from characters to list-columns and vice versa,
- modify the elements in list-columns, and
- convert list-columns to numerics.
This allows for greater flexibility in the preprocessing tasks that can be done while staying inside the recipes framework. This also prevents having a single step with many arguments.
Workflows
First we start by creating a
recipe
object from the original data.
data("okc_text")
rec_obj <- recipe(~ ., okc_text)
rec_obj
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
The workflow in textrecipes so far starts with
step_tokenize()
, followed by a combination of type-1 and type-2 steps ending with a type-3 step.
step_tokenize()
wraps the
tokenizers package for tokenization, but other tokenization functions can be utilized using the custom_token
argument. More information concerning arguments can be found in the documentation. The shortest possible recipes are step_tokenize()
directly followed by a type-3 step.
### Feature hashing done on word tokens
rec_obj %>%
step_tokenize(essay0) %>% # token argument defaults to "words"
step_texthash(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Feature hashing with essay0
### Counting chacter occurrences
rec_obj %>%
step_tokenize(essay0, token = "character") %>%
step_tf(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Term frequency with essay0
If one wanted to calculate the word count of the top 100 most frequently used words after stemming is performed, type-2 steps are needed. Here we use
step_stem()
to perform stemming using the
SnowballC package and
step_tokenfilter()
to keep only the 100 most frequent tokens.
rec_obj %>%
step_tokenize(essay0) %>%
step_stem(essay0) %>%
step_tokenfilter(essay0, max_tokens = 100) %>%
step_tf(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Stemming for essay0
#> Text filtering for essay0
#> Term frequency with essay0
For more combinations, please consult the documentation and the vignette, which includes recipe examples.
Acknowledgements
A big thank you goes out to the 6 people who contributed to this release: @ClaytonJY, @DavisVaughan, @EmilHvitfeldt, @jwijffels, @kanishkamisra, and @topepo.