tidyclust is on CRAN

We’re very pleased to announce the release of tidyclust 0.1.0. tidyclust is the tidymodels extension for working with clustering models. This package wouldn’t have been possible without the great work of Kelly Bodwin.

You can install it from CRAN with:

install.packages("tidyclust")

This blog post will introduce tidyclust, how to use it with the rest of tidymodels, and how we can interact and evaluate the fitted clustering models.

library(tidymodels) 
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
#> ✔ broom        1.0.1      ✔ recipes      1.0.3 
#> ✔ dials        1.1.0      ✔ rsample      1.1.0 
#> ✔ dplyr        1.0.10     ✔ tibble       3.1.8 
#> ✔ ggplot2      3.4.0      ✔ tidyr        1.2.1 
#> ✔ infer        1.0.4      ✔ tune         1.0.1 
#> ✔ modeldata    1.0.1      ✔ workflows    1.1.2 
#> ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
#> ✔ purrr        0.3.5      ✔ yardstick    1.1.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyclust)

Specifying clustering models

The first thing we need to do is decide on the type of clustering model we want to fit. The pkgdown site provides a list of all clustering specifications provided by tidyclust. We are slowly adding more types of models— suggestions in issues are highly welcome!

We will use a K-Means model for these examples using k_means() to create a specification. As with other packages in the tidymodels, tidyclust tries to make use of informative names for functions and arguments; as such, the argument denoting the number of clusters is num_clusters rather than k.

kmeans_spec <- k_means(num_clusters = 4) %>%
  set_engine("ClusterR")
kmeans_spec
#> K Means Cluster Specification (partition)
#> 
#> Main Arguments:
#>   num_clusters = 4
#> 
#> Computational engine: ClusterR

We can use the set_engine(), set_mode(), and set_args() functions we are familiar with from parsnip. The specification itself isn’t worth much if we don’t apply it to some data. We will use the ames data set from the modeldata package.

data("ames", package = "modeldata")

This data set contains a number of categorical variables that unaltered can’t be used with a K-Means model. Some light preprocessing can be done using the recipes package.

rec_spec <- recipe(~ ., data = ames) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors(), threshold = 0.8)

This recipe normalizes all of the numeric variables before applying PCA to create a more minimal set of uncorrelated features. Notice how we didn’t specify an outcome as clustering models are unsupervised, meaning that we don’t have outcomes.

These two specifications can be combined in a workflow().

kmeans_wf <- workflow(rec_spec, kmeans_spec)

This workflow can then be fit to the ames data set.

kmeans_fit <- fit(kmeans_wf, data = ames)
kmeans_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: k_means()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_pca()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> KMeans Cluster
#>  Call: ClusterR::KMeans_rcpp(data = data, clusters = clusters) 
#>  Data cols: 121 
#>  Centroids: 4 
#>  BSS/SS: 0.1003306 
#>  SS: 646321.6 = 581475.8 (WSS) + 64845.81 (BSS)

We have arbitrarily set the number of clusters to 4 above. If we wanted to figure out what values would be “optimal,” we would have to fit multiple models. We can do this with tune_cluster(); to make use of this function, though, we first need to use tune() to specify that num_clusters is the argument we want to try with multiple values.

kmeans_spec <- kmeans_spec %>% 
  set_args(num_clusters = tune())

kmeans_wf <- workflow(rec_spec, kmeans_spec)
kmeans_wf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: k_means()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_pca()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> K Means Cluster Specification (partition)
#> 
#> Main Arguments:
#>   num_clusters = tune()
#> 
#> Computational engine: ClusterR

We can use tune_cluster() in the same way we use tune_grid(), using bootstraps to fit multiple models for each value of num_clusters.

set.seed(1234)
boots <- bootstraps(ames, times = 10)

tune_res <- tune_cluster(
  kmeans_wf,
  resamples = boots
)

The different collect functions such as collect_metrics() works as they would do with tune output.

collect_metrics(tune_res)
#> # A tibble: 18 × 7
#>    num_clusters .metric          .estimator    mean     n std_err .config       
#>           <int> <chr>            <chr>        <dbl> <int>   <dbl> <chr>         
#>  1            6 sse_total        standard   624435.    10   1675. Preprocessor1…
#>  2            6 sse_within_total standard   557147.    10   2579. Preprocessor1…
#>  3            1 sse_total        standard   624435.    10   1675. Preprocessor1…
#>  4            1 sse_within_total standard   624435.    10   1675. Preprocessor1…
#>  5            3 sse_total        standard   624435.    10   1675. Preprocessor1…
#>  6            3 sse_within_total standard   588001.    10   5703. Preprocessor1…
#>  7            5 sse_total        standard   624435.    10   1675. Preprocessor1…
#>  8            5 sse_within_total standard   568085.    10   3821. Preprocessor1…
#>  9            9 sse_total        standard   624435.    10   1675. Preprocessor1…
#> 10            9 sse_within_total standard   535120.    10   2262. Preprocessor1…
#> 11            2 sse_total        standard   624435.    10   1675. Preprocessor1…
#> 12            2 sse_within_total standard   599762.    10   4306. Preprocessor1…
#> 13            8 sse_total        standard   624435.    10   1675. Preprocessor1…
#> 14            8 sse_within_total standard   541813.    10   2506. Preprocessor1…
#> 15            4 sse_total        standard   624435.    10   1675. Preprocessor1…
#> 16            4 sse_within_total standard   583604.    10   5523. Preprocessor1…
#> 17            7 sse_total        standard   624435.    10   1675. Preprocessor1…
#> 18            7 sse_within_total standard   548299.    10   2907. Preprocessor1…

Extraction

Going back to the first model we fit, tidyclust provides three main tools for interfacing with a fitted cluster model:

extract cluster assignments
extract centroid locations
prediction with new data

Each of these tasks has a function associated with them. First, we have extract_cluster_assignment(), which can be used on fitted tidyclust objects, alone or as a part of a workflow, and it returns the cluster assignment as a factor named .cluster in a tibble.

extract_cluster_assignment(kmeans_fit)
#> # A tibble: 2,930 × 1
#>    .cluster 
#>    <fct>    
#>  1 Cluster_1
#>  2 Cluster_1
#>  3 Cluster_1
#>  4 Cluster_1
#>  5 Cluster_2
#>  6 Cluster_2
#>  7 Cluster_2
#>  8 Cluster_2
#>  9 Cluster_2
#> 10 Cluster_2
#> # … with 2,920 more rows

The location of the clusters can be found using extract_centroids() which again returns a tibble, with .cluster being a factor with the same levels as what we got from extract_cluster_assignment().

extract_centroids(kmeans_fit)
#> # A tibble: 4 × 122
#>   .cluster   PC001  PC002  PC003  PC004  PC005    PC006   PC007  PC008   PC009
#>   <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
#> 1 Cluster_1 -5.76   0.713 11.9    2.80   4.09   3.44     1.26   -0.280 -0.486 
#> 2 Cluster_2  3.98  -1.18   0.126  0.718  0.150  0.0554  -0.0460 -0.346  0.0599
#> 3 Cluster_3 -0.970  2.45  -0.604 -0.523  0.302 -0.298   -0.174   0.507 -0.153 
#> 4 Cluster_4 -4.40  -2.30  -0.658 -0.671 -1.29  -0.00751  0.222  -0.250  0.223 
#> # … with 112 more variables: PC010 <dbl>, PC011 <dbl>, PC012 <dbl>,
#> #   PC013 <dbl>, PC014 <dbl>, PC015 <dbl>, PC016 <dbl>, PC017 <dbl>,
#> #   PC018 <dbl>, PC019 <dbl>, PC020 <dbl>, PC021 <dbl>, PC022 <dbl>,
#> #   PC023 <dbl>, PC024 <dbl>, PC025 <dbl>, PC026 <dbl>, PC027 <dbl>,
#> #   PC028 <dbl>, PC029 <dbl>, PC030 <dbl>, PC031 <dbl>, PC032 <dbl>,
#> #   PC033 <dbl>, PC034 <dbl>, PC035 <dbl>, PC036 <dbl>, PC037 <dbl>,
#> #   PC038 <dbl>, PC039 <dbl>, PC040 <dbl>, PC041 <dbl>, PC042 <dbl>, …

Lastly, if the model has a notion that translates to “prediction,” then predict() will give you those results as well. In the case of K-Means, this is being interpreted as “which centroid is this observation closest to.”

predict(kmeans_fit, new_data = slice_sample(ames, n = 10))
#> # A tibble: 10 × 1
#>    .pred_cluster
#>    <fct>        
#>  1 Cluster_4    
#>  2 Cluster_2    
#>  3 Cluster_4    
#>  4 Cluster_3    
#>  5 Cluster_1    
#>  6 Cluster_4    
#>  7 Cluster_2    
#>  8 Cluster_2    
#>  9 Cluster_1    
#> 10 Cluster_4

Please check the pkgdown site for more in-depth articles. We couldn’t be happier to have this package on CRAN and we encouraging you to check it out.

Acknowledgements

A big thanks to all the contributors: @aephidayatuloh, @avishaitsur, @bryanosborne, @cgoo4, @coforfe, @EmilHvitfeldt, @JauntyJJS, @kbodwin, @malcolmbarrett, @mattwarkentin, @ninohardt, @nipnipj, and @tomazweiss.