I am delighted to announced that broom 0.5.0 is now available on CRAN. broom 0.5.0 is a major new release featuring changes that affect both users and developers. See the News for a detailed list of changes.
This release was possible due to RStudio’s internship program, which has enabled me ( Alex Hayes) to act as broom’s maintainer for the course of the summer. David Robinson continues to steer design decisions. Many thanks to both RStudio and Dave for this opportunity.
Tibble output
All tidiers should now return tibble
s rather than data.frame
s. This allows broom to take advantage of the nice tibble print method and the more consistent behavior of tibbles:
library(broom)
fit <- lm(mpg ~ ., mtcars)
tidy(fit)
## # A tibble: 11 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12.3 18.7 0.657 0.518
## 2 cyl -0.111 1.05 -0.107 0.916
## 3 disp 0.0133 0.0179 0.747 0.463
## 4 hp -0.0215 0.0218 -0.987 0.335
## 5 drat 0.787 1.64 0.481 0.635
## # ... with 6 more rows
These changes will mostly likely affect you when you:
- subset with
[
, which always returns a tibble. - set rownames on a tibble, which is deprecated.
- use augment methods on models with matrix covariates specified in a formula, which will error.
augment()
will error with matrix covariates because tibbles are more strict about their contents than data frames. More details are
available below.
Deprecated tidiers still return data frames. Tidiers for mixed models also return data frames.
New tidiers
broom 0.5.0 introduces tidiers for:
lavaan
objects from thelavaan
packageivreg
objects from theAER
packageKendall
objects from theKendall
packagegarch
objects from thetseries
packageirlba
lists from theirlba
packagedurbinWatsonTest
objects from thecar
packageconfusionMatrix
objects from thecaret
packageglmnet
andcv.glmnet
objects from theglmnetUtils
packageclm
andclmm
objects from theordinal
packagesvyolr
objects from thesurvey
package, andpolr
objects from theMASS
package.
In addition to these new tidiers, this release includes fixes for a large number of bugs in existing tidiers.
New test suite
We are heavily invested in making it easier to contribute to broom, and also in making broom behavior more standardized and consistent. To this end, we’ve written new testing infrastructure. At the moment, the new tests mostly ensure tibble output. For example, tidy()
output should now pass the following test:
td <- tidy(model)
check_tidy_output(td)
Similar tests exist for glance()
and augment()
. Stricter versions of these tests are under development for future releases.
Mixed models are moving to broom.mixed
As broom’s popularity has grown, broom has grown to encompass a fairly broad range of models. Dave and I have little to no experience with many of these models, and while we can fix bugs in the tidying code, we are no longer able to determine what constitutes a reasonable summary for many of these models.
Our intended solution is to split broom into several packages for tidying model objects. broom will provide tidiers for popular models (and those in base
and stats
), and then domain experts will manage domain-specific tidying packages. Currently we’re working on a spec for all of these sub-packages to implement. With any luck this we’ll have a well-written spec to accompany the next release. We’d like all of the domain-specific tidying packages to eventually live in
tidymodels, so that users can load a bunch of tools all at once with library(tidymodels)
. tidymodels will act as meta-package that gathers together tidyverse compatible tools for modelling. Max Kuhn has migrated a number of his packages to the tidymodels organization, and we plan to move broom in the near future.
Mixed-model tidiers have long been a bit of a mess in broom. A while back
broom.mixed
forked off to clean them up. broom.mixed is now a pilot for the larger project of splitting broom into domain specific tidying packages. We anticipate that broom.mixed will makes its way onto CRAN in the next several weeks, which will allow us to deprecate mixed model tidiers in broom 0.7.0. Although these models are
not yet deprecated, there is currently no ongoing development work for them. In particular, the tidiers for:
- lme, lme4 and nmle models,
- brms models,
- rstanarm models, and
- mcmc objects
are one release away from deprecation, and effectively frozen.
New suggested workflow
When working with many models at the same time, we now recommend using list-columns and a nest()-map()-unnest()
workflow. This mirrors similar moves across the rest of the tidyverse. We have updated the
kmeans,
broom and dplyr and
bootstrapping vignettes to reflect the new workflow. Additional, we’ve updated the bootstrapping vignette to use
rsample rather than the now-deprecated bootstrap()
function. We no longer recommend the older group_by()-do()
workflow.
New vignettes and documentation
The list of available tidiers has been moved out of the README and into the Available Methods vignette.
We also have two new vignettes that are strictly works in progress at the moment. The first covers Adding New Tidiers and seeks to make the barrier of entry for broom contributions as low as possible. The second contains a Glossary of terms we are developing for use in an upcoming release of broom. This glossary will standardize argument names across tidiers, and column names across tidy output.
We have also migrated to a new template-based documentation strategy. Repeated documentation material now lives in roxygen2
templates and can easily be added to a new tidy method. For an example of how this works, see
R/aer-tidiers.R
.
Deprecations
Hard deprecations
inflate()
has been removed frombroom
- Matrix and vector tidiers have been deprecated in favor of
tibble::as_tibble()
andtibble::enframe()
- Dataframe tidiers and rowwise dataframe tidiers have been deprecated
bootstrap()
has been deprecated in favor of thersample
Soft deprecations
The following functions will be deprecated in the next release of broom:
sp
tidying methods (in favor ofsf
)tidy.summaryDefault()
(in favor ofskimr::skim()
)tidy.table()
(in favor oftibble::as_tibble()
)- Mixed model and bayesian tidiers
Contributors
Max Kuhn provided advice on dealing with model objects. Mara Averick provided feedback on drafts of this post.
An additional 38 fantastic contributors offered thoughtful comments on design, wrote bug reports and created PRs. The broom community has been kind, supportive and insighftul and I look forward to working you all again!
@atyre2, @batpigandme, @bfgray3, @bmannakee, @briatte, @cawoodjm, @cimentadaj, @dan87134, @dmenne, @ekatko1, @ellessenne, @erleholgersen, @Hong-Revo, @huftis, @IndrajeetPatil, @jacob-long, @jarvisc1, @jenzopr, @jgabry, @jimhester, @josue-rodriguez, @karldw, @kfeilich, @larmarange, @lboller, @mariusbarth, @michaelweylandt, @mine-cetinkaya-rundel, @mkuehn10, @mvevans89, @nutterb, @ShreyasSingh, @stephlocke, @strengejacke, @topepo, @willbowditch, @WillemSleegers, and @wilsonfreitas
Additional details on tibbles and augment()
Data frames allow users to specify columns in a matrix, like so:
y <- rnorm(5)
x <- matrix(rnorm(10), nrow = 5)
df <- data.frame(x, y)
Tibbles do not:
library(tibble)
tibble(x, y)
## Error: Column `x` must be a 1d atomic vector or a list
Modelling functions will occasionally create a data frame like this, but since the model frame can’t be coerced a tibble method, augment()
will fail:
fit <- lm(y ~ x, df)
augment(fit)
## Error: Column `x` must be a 1d atomic vector or a list
In some cases, explicitly passing the original dataset via the data
argument can resolve this:
augment(fit, data = df)
## # A tibble: 5 x 10
## X1 X2 y .fitted .se.fit .resid .hat .sigma .cooksd
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.617 -0.720 -0.167 0.173 0.341 -0.340 0.661 0.108 1.26
## 2 -0.164 0.943 -0.389 -0.158 0.291 -0.232 0.480 0.499 0.181
## 3 -0.434 0.424 -0.339 -0.529 0.390 0.190 0.863 0.300 3.12
## 4 0.231 0.696 0.148 0.150 0.310 -0.00177 0.545 0.594 0.0000156
## 5 0.663 -0.274 0.704 0.320 0.282 0.384 0.451 0.290 0.418
## # ... with 1 more variable: .std.resid <dbl>
Support for matrix-columns is on the way in dplyr and in a release cycle or two this won’t be an issue.