We’ve sent a few packages to CRAN recently. Here’s a recap of the changes (and some notes at the bottom):
recipes 0.1.6
Breaking Changes
- Since 2018, a warning has been issued when the wrong argument was used in
bake(recipe, newdata)
. The deprecation period is over andnew_data
is officially required. - Previously, if
step_other()
did not collapse any levels, it would still add an “other” level to the factor. This would lump new factor levels into “other” when data were baked (asstep_novel()
does). This no longer occurs, since it was inconsistent with?step_other
, which said that: “If no pooling is done the data are unmodified”.
New Operations:
-
step_normalize()
centers and scales the data (if you are, like Max, too lazy to use two separate steps). -
step_unknown()
will convert missing data in categorical columns to “unknown” and update factor levels.
Other Changes:
- If the
threshold
argument ofstep_other()
is greater than one, it specifies the minimum sample size before the levels of the factor are collapsed into the “other” category. #289 -
step_knnimpute()
can now pass two options to the underlying knn code, including the number of threads ( #323). - Due to changes by CRAN,
step_nnmf()
only works on versions of R >= 3.6.0 due to dependency issues. -
step_dummy()
andstep_other()
are now tolerant to cases where that step’s selectors do not capture any columns. In this case, no modifications to the data are made. ( #290, #348) -
step_dummy()
can now retain the original columns that are used to make the dummy variables by settingpreserve = TRUE
. ( #328) -
step_other()
‘s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing). ( #338) -
step_pca()
,step_kpca()
,step_ica()
,step_nnmf()
,step_pls()
, andstep_isomap()
now accept zero components. In this case, the original data are returned. Please use this with great care.
embed 0.0.3
Two new steps were added:
-
step_umap()
was added for both supervised and unsupervised encodings. -
step_woe()
creates weight of evidence encodings. Thanks to Athos Petri Damiani for this.
rsample 0.0.5
- Added three functions to compute different bootstrap confidence intervals.
- A new function (
add_resample_id()
) augments a data frame with columns for the resampling identifier. - Updated
initial_split()
,mc_cv()
,vfold_cv()
,bootstraps()
, andgroup_vfold_cv()
to use tidyselect on the stratification variable. - Updated
initial_split()
,mc_cv()
,vfold_cv()
, andbootstraps()
with newbreaks
parameter that specifies the number of bins to stratify by for a numeric stratification variable.
parsnip 0.0.3.1
Unplanned release based on CRAN requirements for Solaris.
Breaking Changes
-
The method that
parsnip
uses to store the model information has changed. Any custom models from previous versions will need to use the new method for registering models. The methods are detailed in?get_model_env
and the package vignette for adding models. -
The mode needs to be declared for models that can be used for more than one mode prior to fitting and/or translation.
-
For
surv_reg()
, the engine that uses thesurvival
package is now calledsurvival
instead ofsurvreg
. -
For
glmnet
models, the full regularization path is always fit regardless of the value given topenalty
. Previously, the model was fit by passingpenalty
toglmnet
'slambda
argument, and the model could only make predictions at those specific values. (#195)
New Features
-
add_rowindex()
can create a column called.row
to a data frame. -
If a computational engine is not explicitly set, a default will be used. Each default is documented on the corresponding model page. A warning is issued at fit time unless verbosity is zero.
-
nearest_neighbor()
gained amulti_predict
method. Themulti_predict()
documentation is a little better organized. -
A suite of internal functions were added to help with upcoming model tuning features.
-
A
parsnip
object always saved the name(s) of the outcome variable(s) for proper naming of the predicted values.
corrr 0.4
New features
- New function called
dice()
function, wrapsfocus(x,..., mirror = TRUE)
- A new
retract()
function does the opposite ofstretch()
- A new argument was added to
stretch()
calledremove.dups
. It removes duplicates with out removing all NAs.
Improvements
correlate()
's interface for databases was improved. It now only calculates unique pairs, and simplifies the formula that ultimately runs in-database. We also re-added the vignette to the package, which is also available on the site as an article
tidypredict 0.4.3
New models
The new version is now able to parse the following models:
cubist()
, from theCubist
packagectree()
, from thepartykit
package- XGBoost trained models, via the
xgboost
package
New features
- Integration with
broom
'stidy()
function. It works with Regression models only - Adds support for
parsnip
fitted models:lm
,randomForest
,ranger
, andearth
- Adds
as_parsed_model()
function. It adds the proper class components to the list. This allows any model exported in the correct spec to be read in bytidypredict
. See the Save Models and Non-R models for more information
Improvements
- Now supports classification models from
ranger
Website
The package’s official website has been expanded greatly. Here are some highlights:
- An article per each supported model, they are found under Model List
- A how to guide to save and reload models, link here
- How to integrate non-R models to
tidypredict
, link here
yardstick 0.0.4
New Metrics
Two new metrics have been added to yardstick:
-
iic()
is a numeric metric for computing the index of ideality of correlation. It is a potential alternative to the traditional correlation coefficient, and has been used in QSAR models ( #115). -
average_precision()
is a probability metric that can be used as an alternative topr_auc()
. It has the benefit of avoiding any issues of ambiguity in the edge case whererecall == 0
and the current number of false positives is0
.
Improvements
pr_curve()
(and by extensionpr_auc()
) has been greatly improved to better handle edge cases when duplicate class probability values are present. Additionally, the first precision value in the curve is now a1
, rather than anNA
, which results in a more practical curve, and generates a more correct AUC value ( #93).- Each metric function now has a
direction
attribute, which specifies the direction required for optimization, either minimization or maximization. - Documentation for class probability metrics has been improved with more informative examples ( #100).
mn_log_loss()
now uses the min/max rule before computing the log of the estimated probabilities to avoid problematic undefined log values ( #103).
Upcoming Changes and Directions
We are currently working on two general use packages: workflows
and tune
. The former bundles together recipes, model object, and other items so that there can be single fit()
and predict()
methods. tune
will have tools for… um… tuning models. We are hoping to make these public in the next month or so.
There will be some changes to accommodate model tuning. The dials
package has been re-factored substantially (see the current GH master branch) and there were some small interfaces changes to recipes
too (mostly backwards compatible and also on GH). We are pretty close to end of “Phase I” of our tidymodels work.