We’re brimming with glee to announce the release of
bonsai 0.3.0. bonsai is a parsnip extension package for tree-based models, and includes support for random forest and gradient-boosted tree frameworks like partykit and LightGBM. This most recent release of the package introduces support for the "aorsf"
engine, which implements accelerated oblique random forests (Jaeger et al. 2022, Jaeger et al. 2024).
You can install it from CRAN with:
install.packages("bonsai")
This blog post will demonstrate a modeling workflow where the benefits of using oblique random forests shine through.
You can see a full list of changes in the release notes.
The meats
data
The modeldata package, loaded automatically with the tidymodels meta-package, includes several example datasets to demonstrate modeling problems. We’ll make use of a dataset called meats
in this post. Each row is a measurement of a sample of finely chopped meat.
meats
#> # A tibble: 215 × 103
#> x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010 x_011 x_012 x_013
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2.62 2.62 2.62 2.62 2.62 2.62 2.62 2.62 2.63 2.63 2.63 2.63 2.64
#> 2 2.83 2.84 2.84 2.85 2.85 2.86 2.86 2.87 2.87 2.88 2.88 2.89 2.90
#> 3 2.58 2.58 2.59 2.59 2.59 2.59 2.59 2.60 2.60 2.60 2.60 2.61 2.61
#> 4 2.82 2.82 2.83 2.83 2.83 2.83 2.83 2.84 2.84 2.84 2.84 2.85 2.85
#> 5 2.79 2.79 2.79 2.79 2.80 2.80 2.80 2.80 2.81 2.81 2.81 2.82 2.82
#> 6 3.01 3.02 3.02 3.03 3.03 3.04 3.04 3.05 3.06 3.06 3.07 3.08 3.09
#> 7 2.99 2.99 3.00 3.01 3.01 3.02 3.02 3.03 3.04 3.04 3.05 3.06 3.07
#> 8 2.53 2.53 2.53 2.53 2.53 2.53 2.53 2.53 2.54 2.54 2.54 2.54 2.54
#> 9 3.27 3.28 3.29 3.29 3.30 3.31 3.31 3.32 3.33 3.33 3.34 3.35 3.36
#> 10 3.40 3.41 3.41 3.42 3.43 3.43 3.44 3.45 3.46 3.47 3.48 3.48 3.49
#> # ℹ 205 more rows
#> # ℹ 90 more variables: x_014 <dbl>, x_015 <dbl>, x_016 <dbl>, x_017 <dbl>,
#> # x_018 <dbl>, x_019 <dbl>, x_020 <dbl>, x_021 <dbl>, x_022 <dbl>,
#> # x_023 <dbl>, x_024 <dbl>, x_025 <dbl>, x_026 <dbl>, x_027 <dbl>,
#> # x_028 <dbl>, x_029 <dbl>, x_030 <dbl>, x_031 <dbl>, x_032 <dbl>,
#> # x_033 <dbl>, x_034 <dbl>, x_035 <dbl>, x_036 <dbl>, x_037 <dbl>,
#> # x_038 <dbl>, x_039 <dbl>, x_040 <dbl>, x_041 <dbl>, x_042 <dbl>, …
From that dataset’s documentation:
These data are recorded on a Tecator Infratec Food and Feed Analyzer… For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.
We’ll try to predict the protein content, as a percentage, using the absorbance measurements.
Before we take a further look, let’s split up our data. I’ll first select off two other possible outcome variables and, after splitting into training and testing sets, resample the data using 5-fold cross-validation with 2 repeats.
meats <- meats %>% select(-water, -fat)
set.seed(1)
meats_split <- initial_split(meats)
meats_train <- training(meats_split)
meats_test <- testing(meats_split)
meats_folds <- vfold_cv(meats_train, v = 5, repeats = 2)
The tricky parts of this modeling problem are that:
- There are few observations to work with (215 total).
- Each of these 100 absorbance measurements are highly correlated.
Visualizing that correlation:
meats_train %>%
correlate() %>%
autoplot() +
theme(axis.text.x = element_blank(), axis.text.y = element_blank())
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
Almost all of these pairwise correlations between predictors are near 1, besides the last variable and every other variable. That last variable with weaker correlation values? It’s the outcome.
Baseline models
There are several existing model implementations in tidymodels that are resilient to highly correlated predictors. The first one I’d probably reach for is an elastic net: an interpolation of the LASSO and Ridge regularized linear regression models. Evaluating that modeling approach against resamples:
# define a regularized linear model
spec_lr <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
# try out different penalization approaches
res_lr <- tune_grid(spec_lr, protein ~ ., meats_folds)
show_best(res_lr, metric = "rmse")
#> # A tibble: 5 × 8
#> penalty mixture .metric .estimator mean n std_err .config
#> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 0.0000324 0.668 rmse standard 1.24 10 0.0516 Preprocessor1_Mo…
#> 2 0.00000000524 0.440 rmse standard 1.25 10 0.0548 Preprocessor1_Mo…
#> 3 0.000000461 0.839 rmse standard 1.26 10 0.0538 Preprocessor1_Mo…
#> 4 0.00000550 0.965 rmse standard 1.26 10 0.0540 Preprocessor1_Mo…
#> 5 0.0000000489 0.281 rmse standard 1.26 10 0.0534 Preprocessor1_Mo…
show_best(res_lr, metric = "rsq")
#> # A tibble: 5 × 8
#> penalty mixture .metric .estimator mean n std_err .config
#> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 0.0000324 0.668 rsq standard 0.849 10 0.0126 Preprocessor1_Mo…
#> 2 0.00000000524 0.440 rsq standard 0.848 10 0.0128 Preprocessor1_Mo…
#> 3 0.000000461 0.839 rsq standard 0.846 10 0.0114 Preprocessor1_Mo…
#> 4 0.00000550 0.965 rsq standard 0.846 10 0.0111 Preprocessor1_Mo…
#> 5 0.0000000489 0.281 rsq standard 0.846 10 0.0126 Preprocessor1_Mo…
That best RMSE value of 1.24 gives us a baseline to work with, and the best R-squared 0.85 seems like a good start.
Many tree-based model implementations in tidymodels generally handle correlated predictors well. Just to be apples-to-apples with "aorsf"
, let’s use a different random forest engine to get a better sense for baseline performance:
spec_rf <-
rand_forest(mtry = tune(), min_n = tune()) %>%
# this is the default engine, but for consistency's sake:
set_engine("ranger") %>%
set_mode("regression")
res_rf <- tune_grid(spec_rf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry
show_best(res_rf, metric = "rmse")
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 96 4 rmse standard 2.37 10 0.0905 Preprocessor1_Model08
#> 2 41 6 rmse standard 2.39 10 0.0883 Preprocessor1_Model01
#> 3 88 10 rmse standard 2.43 10 0.0816 Preprocessor1_Model06
#> 4 79 17 rmse standard 2.51 10 0.0740 Preprocessor1_Model07
#> 5 27 18 rmse standard 2.52 10 0.0778 Preprocessor1_Model04
show_best(res_rf, metric = "rsq")
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 96 4 rsq standard 0.424 10 0.0385 Preprocessor1_Model08
#> 2 41 6 rsq standard 0.409 10 0.0394 Preprocessor1_Model01
#> 3 88 10 rsq standard 0.387 10 0.0365 Preprocessor1_Model06
#> 4 79 17 rsq standard 0.353 10 0.0404 Preprocessor1_Model07
#> 5 27 18 rsq standard 0.346 10 0.0397 Preprocessor1_Model04
Not so hot. Just to show I’m not making a straw man here, I’ll evaluate a few more alternative modeling approaches behind the curtain and print out their best performance metrics:
- Gradient boosted tree with LightGBM. Best RMSE: 2.34. Best R-squared: 0.43.
- Partial least squares regression. Best RMSE: 1.39. Best R-squared: 0.81.
- Support vector machine. Best RMSE: 2.28. Best R-squared: 0.46.
This is a tricky one.
Introducing accelerated oblique random forests
The 0.3.0 release of bonsai introduces support for accelerated oblique random forests via the "aorsf"
engine for classification and regression in tidymodels. (Tidy survival modelers might note that
we already support "aorsf"
for censored regression via the
censored parsnip extension package!)
Unlike trees in conventional random forests, which create splits using thresholds based on individual predictors (e.g. x_001 > 3
), oblique random forests use linear combinations of predictors to create splits (e.g. x_001 * x_002 > 7.5
) and have been shown to improve predictive performance related to conventional random forests for a variety of applications (Menze et al. 2011). “Oblique” references the appearance of decision boundaries when a set of splits is plotted; I’ve grabbed a visual from the
aorsf README that demonstrates:
In the above, we’d like to separate the purple dots from the orange squares. A tree in a traditional random forest, represented on the left, can only generate splits based on one of X1 or X2 at a time. A tree in an oblique random forest, represented on the right, can consider both X1 and X2 in creating decision boundaries, often resulting in stronger predictive performance.
Where does the “accelerated” come from? Generally, finding optimal oblique splits is computationally more intensive than finding single-predictor splits. The aorsf package uses something called “Newton Raphson scoring”—the same algorithm under the hood in the survival package—to identify splits based on linear combinations of predictor variables. This approach speeds up that process greatly, resulting in fit times that are analogous to implementations of traditional random forests in R (and hundreds of times faster than existing oblique random forest implementations, Jaeger et al. 2024).
The code to tune this model with the "aorsf"
engine is the same as for "ranger"
, except we switch out the engine
argument to
set_engine()
:
spec_aorsf <-
rand_forest(
mtry = tune(),
min_n = tune()
) %>%
set_engine("aorsf") %>%
set_mode("regression")
res_aorsf <- tune_grid(spec_aorsf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry
show_best(res_aorsf, metric = "rmse")
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 87 11 rmse standard 0.786 10 0.0370 Preprocessor1_Model02
#> 2 98 8 rmse standard 0.789 10 0.0363 Preprocessor1_Model10
#> 3 48 5 rmse standard 0.793 10 0.0363 Preprocessor1_Model01
#> 4 16 17 rmse standard 0.803 10 0.0325 Preprocessor1_Model09
#> 5 31 18 rmse standard 0.813 10 0.0359 Preprocessor1_Model05
show_best(res_aorsf, metric = "rsq")
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 48 5 rsq standard 0.946 10 0.00446 Preprocessor1_Model01
#> 2 98 8 rsq standard 0.945 10 0.00482 Preprocessor1_Model10
#> 3 87 11 rsq standard 0.945 10 0.00484 Preprocessor1_Model02
#> 4 16 17 rsq standard 0.941 10 0.00370 Preprocessor1_Model09
#> 5 31 18 rsq standard 0.940 10 0.00547 Preprocessor1_Model05
Holy smokes. The best RMSE from aorsf is 0.79, much more performant than the previous best RMSE from the elastic net with a value of 1.24, and the best R-squared is 0.95, much stronger than the previous best (also from the elastic net) of 0.85.
Especially if your modeling problems involve few samples of many, highly correlated predictors, give the "aorsf"
modeling engine a whirl in your workflows and let us know what you think!
References
Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, Jaime L. Speiser, Matthew W. Segar, Ambarish Pandey, Nicholas M. Pajewski. 2024. “Accelerated and Interpretable Oblique Random Survival Forests.” Journal of Computational and Graphical Statistics 33.1: 192-207.
Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, and Nicholas M. Pajewski. 2022. “aorsf: An R package for Supervised Learning Using the Oblique Random Survival Forest.” The Journal of Open Source Software.
Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht. (2011). “On Oblique Random Forests.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 453–469). Springer.
Acknowledgements
Thank you to @bcjaeger, the aorsf author, for doing most of the work to implement aorsf support in bonsai. Thank you to @hfrick, @joranE, @jrosell, @nipnipj, @p-schaefer, @seb-mueller, and @tcovert for their contributions on the bonsai repository since version 0.2.1.