Two New tidymodels Packages

  feature-selection, tidymodels

  Frances Lin, Max Kuhn

We’re very chuffed to announce the release of two new modeling packages: filtro and important.

You can install them from CRAN with:

install.packages(c("filtro", "important"))

This blog post will introduce both.

filtro

Feature selection is an important step in building machine learning models that are robust and reliable. By keeping only the most relevant predictors, we can reduce overfitting, improve model performance, and speed up computation.

filtro is a low-level tidy tools designed for filter-based supervised feature selection. filtro makes it easy to score, rank, and select features using a wide range of statistical and model-based metrics. The scoring metrics include: p-values, correlation, random forest feature importance, information gain, and more.

With filtro, we can quickly rank the variables and select either the top proportion or the top number of features that best contribute to our model. It also supports multi-parameter optimization via desirability functions. filtro is a standalone tool, but it integrates with other packages, allowing it to be used within the tidymodels workflows.

Currently, filtro implements a total of six filters. Like other elements of the framework, also filtro is extensible if you want to use a score we haven’t implemented yet. You can read more on how to do this on tidymodels.org.

The available score class objects are:

##  [1] "score_aov_fstat"          "score_aov_pval"          
##  [3] "score_cor_pearson"        "score_cor_spearman"      
##  [5] "score_gain_ratio"         "score_imp_rf"            
##  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"    
##  [9] "score_info_gain"          "score_roc_auc"           
## [11] "score_sym_uncert"         "score_xtab_pval_chisq"   
## [13] "score_xtab_pval_fisher"

Let’s look at an example. Kuhn and Johnson (2013) described a data set where 176 samples were collected from a chemical manufacturing process. The goal is to predict process yield. Predictors are continuous, count, and categorical; some are correlated, and some contain missing values.

Let’s create an initial split of the data (which are in the modeldata package):

library(tidymodels)
library(filtro)

set.seed(1)
yield_split <- initial_split(modeldata::chem_proc_yield)
yield_split
## <Training/Testing/Total>
## <132/44/176>
yield_train <- training(yield_split)
yield_test <- testing(yield_split)

We’d like to estimate the strength of the relationship between these 57 predictors and the process yield. We’ll quantify that in two ways. First is the old-fashioned Spearman rank correlation statistic. We can estimate these values and rank them by the absolute value of the correlations. We can also measure their value using a random forest variable importance. One quality of the predictors is that their values are correlated, so there may be some value in using an oblique random forest model. This creates a collection of tree-based models with splits that are linear combinations of the selected predictors.

To estimate the scores, we use the score objects contained in the package along with the fit() method:

yield_rank_res <-
  score_cor_spearman |>
  fit(yield ~ ., data = yield_train)

# The object contains the statistics:
yield_rank_res@results |> 
  arrange(desc(abs(score)))
## # A tibble: 57 × 4
##    name          score outcome predictor      
##    <chr>         <dbl> <chr>   <chr>          
##  1 cor_spearman  0.655 yield   man_proc_32    
##  2 cor_spearman -0.537 yield   man_proc_36    
##  3 cor_spearman  0.519 yield   bio_material_03
##  4 cor_spearman  0.502 yield   bio_material_06
##  5 cor_spearman  0.491 yield   man_proc_09    
##  6 cor_spearman  0.478 yield   bio_material_02
##  7 cor_spearman  0.446 yield   man_proc_33    
##  8 cor_spearman  0.421 yield   bio_material_12
##  9 cor_spearman -0.420 yield   man_proc_13    
## 10 cor_spearman  0.412 yield   bio_material_04
## # ℹ 47 more rows

To score via a random forest model, we only need to switch out the score object:

yield_rf_res <-
  score_imp_rf_oblique |>
  fit(yield ~ ., data = yield_train)

yield_rf_res@results |> 
  arrange(desc(abs(score)))
## # A tibble: 57 × 4
##    name            score outcome predictor      
##    <chr>           <dbl> <chr>   <chr>          
##  1 imp_rf_oblique 0.128  yield   man_proc_32    
##  2 imp_rf_oblique 0.0697 yield   man_proc_36    
##  3 imp_rf_oblique 0.0670 yield   man_proc_17    
##  4 imp_rf_oblique 0.0644 yield   man_proc_09    
##  5 imp_rf_oblique 0.0612 yield   man_proc_13    
##  6 imp_rf_oblique 0.0446 yield   bio_material_03
##  7 imp_rf_oblique 0.0315 yield   man_proc_33    
##  8 imp_rf_oblique 0.0263 yield   man_proc_11    
##  9 imp_rf_oblique 0.0263 yield   bio_material_04
## 10 imp_rf_oblique 0.0262 yield   bio_material_06
## # ℹ 47 more rows

We should probably combine the scores and do a joint ranking. To combine the two sets of statistics:

class_score_list <- list(yield_rank_res, yield_rf_res) |>
  bind_scores()

class_score_list
## # A tibble: 57 × 4
##    outcome predictor       cor_spearman imp_rf_oblique
##    <chr>   <chr>                  <dbl>          <dbl>
##  1 yield   bio_material_01        0.404        0.0178 
##  2 yield   bio_material_02        0.478        0.0190 
##  3 yield   bio_material_03        0.519        0.0446 
##  4 yield   bio_material_04        0.412        0.0263 
##  5 yield   bio_material_05        0.116        0.00639
##  6 yield   bio_material_06        0.502        0.0262 
##  7 yield   bio_material_07       -0.101        0.00151
##  8 yield   bio_material_08        0.369        0.00714
##  9 yield   bio_material_09        0.109        0.0122 
## 10 yield   bio_material_10        0.214        0.00998
## # ℹ 47 more rows

We can accomplish a joint ranking via desirability functions. Here, we set goals for each score (i.e., maximize, minimize, etc.). The algorithm rescales their values and uses a geometric mean for an overall ranking. The desirability2 package has some nice tools for this. Here’s how we do it:

library(desirability2)
class_score_list |>
  show_best_desirability_prop(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  ) |> 
  arrange(desc(.d_overall)) |> 
  select(-starts_with(".d_max_"))
## # A tibble: 57 × 5
##    outcome predictor       cor_spearman imp_rf_oblique .d_overall
##    <chr>   <chr>                  <dbl>          <dbl>      <dbl>
##  1 yield   man_proc_32            0.655         0.128      0.735 
##  2 yield   man_proc_09            0.491         0.0644     0.291 
##  3 yield   bio_material_03        0.519         0.0446     0.217 
##  4 yield   man_proc_33            0.446         0.0315     0.134 
##  5 yield   bio_material_06        0.502         0.0262     0.129 
##  6 yield   bio_material_04        0.412         0.0263     0.104 
##  7 yield   bio_material_02        0.478         0.0190     0.0926
##  8 yield   bio_material_01        0.404         0.0178     0.0719
##  9 yield   bio_material_11        0.381         0.0194     0.0714
## 10 yield   man_proc_12            0.391         0.0183     0.0705
## # ℹ 47 more rows

Using the scale = 2 option puts more weight on the random forest results.

It is unlikely that users will work with filtro directly; it is much better to incorporate these feature selection tools inside a model workflow (as we will see below).

Now that we’ve looked at filtro, next up is the important package (yes, this is what we named it).

important

The important package does two things. First, it provides yet another tool for calculating random forest-like permutation importance scores. We highly value other packages that perform these same calculations (such as DALEX and vip). Our rationale for creating another package for this is that we’ve developed interfaces for censored regression, including dynamic metrics such as Brier scores or ROC curves that evaluate models at a specific time point. These dynamic methods aren’t available in other packages, and the peculiarities of these metrics make them difficult to incorporate into existing frameworks.

Other niceties about importance scores are that any metric from the yardstick package can be used, and we have optimized parallel processing for the underlying computations. For the latter feature, we support the future and mirai packages for parallel processing.

important also has three recipe steps for supervised feature selection (similar to what Steven Pawley did with his colino package). The steps are:

Let’s look at the last one, which mirrors our analysis above.

library(important)
goals <-
  desirability(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  )

yield_rec <-
  recipe(yield ~ ., data = yield_train) |>
  step_impute_knn(all_predictors(), neighbors = 10) |>
  step_predictor_desirability(
    all_predictors(),
    score = goals,
    prop_terms = 1 / 10
  )
yield_rec
## 
## ── Recipe ───────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 57
## 
## ── Operations
## • K-nearest neighbor imputation for: all_predictors()
## • Feature selection via desirability functions (`cor_spearman`
##   and `imp_rf_oblique`) on: all_predictors()

When combined with a specific model, we can tune the number of neighbors as well as the proportion of predictors retained (10% above).

prep() will do the appropriate estimation steps:

trained_rec <- prep(yield_rec)

Which 10% of the predictors were retained? The tidy() method can list the scores and their rankings:

scores <- tidy(trained_rec, number = 2)
scores |>
  arrange(desc(.d_overall)) |>
  select(-starts_with(".d_max_"), -id)
## # A tibble: 57 × 5
##    terms           removed cor_spearman imp_rf_oblique .d_overall
##    <chr>           <lgl>          <dbl>          <dbl>      <dbl>
##  1 man_proc_32     FALSE          0.655         0.128       0.735
##  2 man_proc_36     FALSE         -0.530         0.0668      0.325
##  3 man_proc_09     FALSE          0.491         0.0673      0.304
##  4 man_proc_13     FALSE         -0.420         0.0725      0.275
##  5 bio_material_03 FALSE          0.519         0.0517      0.249
##  6 bio_material_06 TRUE           0.502         0.0445      0.210
##  7 man_proc_17     TRUE          -0.303         0.0749      0.158
##  8 man_proc_33     TRUE           0.443         0.0374      0.156
##  9 bio_material_02 TRUE           0.478         0.0330      0.151
## 10 bio_material_04 TRUE           0.412         0.0347      0.133
## # ℹ 47 more rows
# What percentage was removed?
mean(scores$removed * 100)
## [1] 91.22807

Summary

Both filtro and important satisfy a feature for tidymodels that has been highly ranked in our user surveys: supervised feature selection. filtro contains the underlying framework and important provides recipe steps that can be used in a workflow.