We’re very pleased to announce the release of tidyclust 0.1.0. tidyclust is the tidymodels extension for working with clustering models. This package wouldn’t have been possible without the great work of Kelly Bodwin.
You can install it from CRAN with:
install.packages("tidyclust")
This blog post will introduce tidyclust, how to use it with the rest of tidymodels, and how we can interact and evaluate the fitted clustering models.
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
#> ✔ broom 1.0.1 ✔ recipes 1.0.3
#> ✔ dials 1.1.0 ✔ rsample 1.1.0
#> ✔ dplyr 1.0.10 ✔ tibble 3.1.8
#> ✔ ggplot2 3.4.0 ✔ tidyr 1.2.1
#> ✔ infer 1.0.4 ✔ tune 1.0.1
#> ✔ modeldata 1.0.1 ✔ workflows 1.1.2
#> ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
#> ✔ purrr 0.3.5 ✔ yardstick 1.1.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ recipes::step() masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyclust)
Specifying clustering models
The first thing we need to do is decide on the type of clustering model we want to fit. The pkgdown site provides a list of all clustering specifications provided by tidyclust. We are slowly adding more types of models— suggestions in issues are highly welcome!
We will use a K-Means model for these examples using
k_means()
to create a specification. As with other packages in the tidymodels, tidyclust tries to make use of informative names for functions and arguments; as such, the argument denoting the number of clusters is num_clusters
rather than k
.
kmeans_spec <- k_means(num_clusters = 4) %>%
set_engine("ClusterR")
kmeans_spec
#> K Means Cluster Specification (partition)
#>
#> Main Arguments:
#> num_clusters = 4
#>
#> Computational engine: ClusterR
We can use the
set_engine()
,
set_mode()
, and
set_args()
functions we are familiar with from parsnip. The specification itself isn’t worth much if we don’t apply it to some data. We will use the ames data set from the modeldata package.
data("ames", package = "modeldata")
This data set contains a number of categorical variables that unaltered can’t be used with a K-Means model. Some light preprocessing can be done using the recipes package.
rec_spec <- recipe(~ ., data = ames) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), threshold = 0.8)
This recipe normalizes all of the numeric variables before applying PCA to create a more minimal set of uncorrelated features. Notice how we didn’t specify an outcome as clustering models are unsupervised, meaning that we don’t have outcomes.
These two specifications can be combined in a workflow()
.
kmeans_wf <- workflow(rec_spec, kmeans_spec)
This workflow can then be fit to the ames
data set.
kmeans_fit <- fit(kmeans_wf, data = ames)
kmeans_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: k_means()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#>
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_pca()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> KMeans Cluster
#> Call: ClusterR::KMeans_rcpp(data = data, clusters = clusters)
#> Data cols: 121
#> Centroids: 4
#> BSS/SS: 0.1003306
#> SS: 646321.6 = 581475.8 (WSS) + 64845.81 (BSS)
We have arbitrarily set the number of clusters to 4 above. If we wanted to figure out what values would be “optimal,” we would have to fit multiple models. We can do this with
tune_cluster()
; to make use of this function, though, we first need to use
tune()
to specify that num_clusters
is the argument we want to try with multiple values.
kmeans_spec <- kmeans_spec %>%
set_args(num_clusters = tune())
kmeans_wf <- workflow(rec_spec, kmeans_spec)
kmeans_wf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: k_means()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#>
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_pca()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> K Means Cluster Specification (partition)
#>
#> Main Arguments:
#> num_clusters = tune()
#>
#> Computational engine: ClusterR
We can use
tune_cluster()
in the same way we use tune_grid()
, using bootstraps to fit multiple models for each value of num_clusters
.
set.seed(1234)
boots <- bootstraps(ames, times = 10)
tune_res <- tune_cluster(
kmeans_wf,
resamples = boots
)
The different
collect functions such as collect_metrics()
works as they would do with tune output.
collect_metrics(tune_res)
#> # A tibble: 18 × 7
#> num_clusters .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 6 sse_total standard 624435. 10 1675. Preprocessor1…
#> 2 6 sse_within_total standard 557147. 10 2579. Preprocessor1…
#> 3 1 sse_total standard 624435. 10 1675. Preprocessor1…
#> 4 1 sse_within_total standard 624435. 10 1675. Preprocessor1…
#> 5 3 sse_total standard 624435. 10 1675. Preprocessor1…
#> 6 3 sse_within_total standard 588001. 10 5703. Preprocessor1…
#> 7 5 sse_total standard 624435. 10 1675. Preprocessor1…
#> 8 5 sse_within_total standard 568085. 10 3821. Preprocessor1…
#> 9 9 sse_total standard 624435. 10 1675. Preprocessor1…
#> 10 9 sse_within_total standard 535120. 10 2262. Preprocessor1…
#> 11 2 sse_total standard 624435. 10 1675. Preprocessor1…
#> 12 2 sse_within_total standard 599762. 10 4306. Preprocessor1…
#> 13 8 sse_total standard 624435. 10 1675. Preprocessor1…
#> 14 8 sse_within_total standard 541813. 10 2506. Preprocessor1…
#> 15 4 sse_total standard 624435. 10 1675. Preprocessor1…
#> 16 4 sse_within_total standard 583604. 10 5523. Preprocessor1…
#> 17 7 sse_total standard 624435. 10 1675. Preprocessor1…
#> 18 7 sse_within_total standard 548299. 10 2907. Preprocessor1…
Extraction
Going back to the first model we fit, tidyclust provides three main tools for interfacing with a fitted cluster model:
- extract cluster assignments
- extract centroid locations
- prediction with new data
Each of these tasks has a function associated with them. First, we have
extract_cluster_assignment()
, which can be used on fitted tidyclust objects, alone or as a part of a workflow, and it returns the cluster assignment as a factor named .cluster
in a tibble.
extract_cluster_assignment(kmeans_fit)
#> # A tibble: 2,930 × 1
#> .cluster
#> <fct>
#> 1 Cluster_1
#> 2 Cluster_1
#> 3 Cluster_1
#> 4 Cluster_1
#> 5 Cluster_2
#> 6 Cluster_2
#> 7 Cluster_2
#> 8 Cluster_2
#> 9 Cluster_2
#> 10 Cluster_2
#> # … with 2,920 more rows
The location of the clusters can be found using
extract_centroids()
which again returns a tibble, with .cluster
being a factor with the same levels as what we got from
extract_cluster_assignment()
.
extract_centroids(kmeans_fit)
#> # A tibble: 4 × 122
#> .cluster PC001 PC002 PC003 PC004 PC005 PC006 PC007 PC008 PC009
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cluster_1 -5.76 0.713 11.9 2.80 4.09 3.44 1.26 -0.280 -0.486
#> 2 Cluster_2 3.98 -1.18 0.126 0.718 0.150 0.0554 -0.0460 -0.346 0.0599
#> 3 Cluster_3 -0.970 2.45 -0.604 -0.523 0.302 -0.298 -0.174 0.507 -0.153
#> 4 Cluster_4 -4.40 -2.30 -0.658 -0.671 -1.29 -0.00751 0.222 -0.250 0.223
#> # … with 112 more variables: PC010 <dbl>, PC011 <dbl>, PC012 <dbl>,
#> # PC013 <dbl>, PC014 <dbl>, PC015 <dbl>, PC016 <dbl>, PC017 <dbl>,
#> # PC018 <dbl>, PC019 <dbl>, PC020 <dbl>, PC021 <dbl>, PC022 <dbl>,
#> # PC023 <dbl>, PC024 <dbl>, PC025 <dbl>, PC026 <dbl>, PC027 <dbl>,
#> # PC028 <dbl>, PC029 <dbl>, PC030 <dbl>, PC031 <dbl>, PC032 <dbl>,
#> # PC033 <dbl>, PC034 <dbl>, PC035 <dbl>, PC036 <dbl>, PC037 <dbl>,
#> # PC038 <dbl>, PC039 <dbl>, PC040 <dbl>, PC041 <dbl>, PC042 <dbl>, …
Lastly, if the model has a notion that translates to “prediction,” then
predict()
will give you those results as well. In the case of K-Means, this is being interpreted as “which centroid is this observation closest to.”
predict(kmeans_fit, new_data = slice_sample(ames, n = 10))
#> # A tibble: 10 × 1
#> .pred_cluster
#> <fct>
#> 1 Cluster_4
#> 2 Cluster_2
#> 3 Cluster_4
#> 4 Cluster_3
#> 5 Cluster_1
#> 6 Cluster_4
#> 7 Cluster_2
#> 8 Cluster_2
#> 9 Cluster_1
#> 10 Cluster_4
Please check the pkgdown site for more in-depth articles. We couldn’t be happier to have this package on CRAN and we encouraging you to check it out.
Acknowledgements
A big thanks to all the contributors: @aephidayatuloh, @avishaitsur, @bryanosborne, @cgoo4, @coforfe, @EmilHvitfeldt, @JauntyJJS, @kbodwin, @malcolmbarrett, @mattwarkentin, @ninohardt, @nipnipj, and @tomazweiss.