Today, I wanted to talk a little bit about the renewed rowwise() function that makes it easy to perform operations “row-by-row”. I’ll show how you can use rowwise() to compute summaries “by row”, talk about how rowwise() is a natural pairing with list-columns, and show a couple of use cases that I think are particularly elegant. You can learn more about all of these topics in
vignette("rowwise").
Update: as of June 1, dplyr 1.0.0 is now available on CRAN! Read
all about it or install it now with install.packages("dplyr").
Basic operation
rowwise() works like group_by() in the sense that it doesn’t change what the data looks like; it changes how dplyr verbs operate on the data. Let’s see how this works with a simple example. Here I have some imaginary test results for students in a class:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
student_id = 1:4,
test1 = 10:13,
test2 = 20:23,
test3 = 30:33,
test4 = 40:43
)
df
#> # A tibble: 4 x 5
#> student_id test1 test2 test3 test4
#> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40
#> 2 2 11 21 31 41
#> 3 3 12 22 32 42
#> 4 4 13 23 33 43
I’d like to be able to compute the mean of the test scores for each student, but mutate() and mean() don’t do what I want:
df %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 x 6
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 26.5
#> 2 2 11 21 31 41 26.5
#> 3 3 12 22 32 42 26.5
#> 4 4 13 23 33 43 26.5
The problem is that I’m getting a mean over the whole data frame, not for each student. I can resolve this problem of getting a mean for each student by creating a “row-wise” data frame with rowwise():
rf <- rowwise(df, student_id)
rowwise() doesn’t need any additional arguments unless you have variables that identify the rows, like student_id here. Much like grouping variables, identifier variables will be automatically preserved when you summarise() the data.
rf
#> # A tibble: 4 x 5
#> # Rowwise: student_id
#> student_id test1 test2 test3 test4
#> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40
#> 2 2 11 21 31 41
#> 3 3 12 22 32 42
#> 4 4 13 23 33 43
rf looks very similar to df, but behaves very differently:
rf %>% mutate(avg = mean(c(test1, test2, test3, test4)))
#> # A tibble: 4 x 6
#> # Rowwise: student_id
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 25
#> 2 2 11 21 31 41 26
#> 3 3 12 22 32 42 27
#> 4 4 13 23 33 43 28
An additional advantage of rowwise() is that it’s paired with
c_across(), which works like c() but uses the same tidyselect syntax as across(). That makes it easy to operate on multiple variables:
rf %>% mutate(avg = mean(c_across(starts_with("test"))))
#> # A tibble: 4 x 6
#> # Rowwise: student_id
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 25
#> 2 2 11 21 31 41 26
#> 3 3 12 22 32 42 27
#> 4 4 13 23 33 43 28
Other ways of achieving the same result
Some summary functions have alternative ways of computing row-wise summaries that take advantage of built-in vectorisation. For example, if you wanted to compute the sum, you could use +:
df %>% mutate(total = test1 + test2 + test3 + test4)
#> # A tibble: 4 x 6
#> student_id test1 test2 test3 test4 total
#> <int> <int> <int> <int> <int> <int>
#> 1 1 10 20 30 40 100
#> 2 2 11 21 31 41 104
#> 3 3 12 22 32 42 108
#> 4 4 13 23 33 43 112
And you could use the same basic idea to compute the mean:
df %>% mutate(avg = (test1 + test2 + test3 + test4) / 4)
#> # A tibble: 4 x 6
#> student_id test1 test2 test3 test4 avg
#> <int> <int> <int> <int> <int> <dbl>
#> 1 1 10 20 30 40 25
#> 2 2 11 21 31 41 26
#> 3 3 12 22 32 42 27
#> 4 4 13 23 33 43 28
Another family of summary functions have “parallel” extensions where you can provide multiple variables in the arguments:
df %>% mutate(
min = pmin(test1, test2, test3, test4),
max = pmax(test1, test2, test3, test4),
string = paste(test1, test2, test3, test4, sep = "-")
)
#> # A tibble: 4 x 8
#> student_id test1 test2 test3 test4 min max string
#> <int> <int> <int> <int> <int> <int> <int> <chr>
#> 1 1 10 20 30 40 10 40 10-20-30-40
#> 2 2 11 21 31 41 11 41 11-21-31-41
#> 3 3 12 22 32 42 12 42 12-22-32-42
#> 4 4 13 23 33 43 13 43 13-23-33-43
Where these functions exist, they’ll usually be faster than rowwise(). The advantage of rowwise() is that it works with any function, not just those that are already vectorised.
List-columns
rowwise() is useful for computing simple summaries, but its real power comes when you use it with list-columns. Because lists can contain anything, you can use list-columns to keep related objects together, regardless of what type of thing they are. List-columns give you a convenient storage mechanism and rowwise() gives you a convenient computation mechanism.
Let’s make those ideas concrete by creating a data frame with a list-column. A little later, we’ll come back to how you might actually get a list-column in a more realistic situation. The following data frame uses list columns to store things that would otherwise be challenging:
xcontains vectors of different lengths.ycontains vectors of different typeszcontains functions, which can’t usually live in a data frame.
df <- tibble(
x = list(1, 2:3, 4:6),
y = list(TRUE, 1, "a"),
z = list(sum, mean, sd)
)
df
#> # A tibble: 3 x 3
#> x y z
#> <list> <list> <list>
#> 1 <dbl [1]> <lgl [1]> <fn>
#> 2 <int [2]> <dbl [1]> <fn>
#> 3 <int [3]> <chr [1]> <fn>
When you have list-columns in a row-wise data frame, you can easily compute with each element of the list:
df %>%
rowwise() %>%
summarise(
x_length = length(x),
y_type = typeof(y),
z_call = z(1:5)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> x_length y_type z_call
#> <int> <chr> <dbl>
#> 1 1 logical 15
#> 2 2 double 3
#> 3 3 character 1.58
This makes a row-wise mutate() or summarise() a general vectorisation tool, in the same way as the apply family in base R or the map family in purrr do. It’s now much simpler to solve a number of problems where we previously recommended learning about map(), map2(), pmap() and friends.
Use cases
To finish up, I wanted to show off a couple of use cases where I think rowwise() provides a really elegant solution: simulations and modelling.
Simulation
The basic idea of using rowwise() to perform simulation is to store all your simulation paramters in a data frame:
df <- tribble(
~id, ~ n, ~ min, ~ max,
1, 3, 0, 1,
2, 2, 10, 100,
3, 2, 100, 1000,
)
Then you can either generate a list-column containing the simulated values with mutate():
df %>%
rowwise(id) %>%
mutate(data = list(runif(n, min, max)))
#> # A tibble: 3 x 5
#> # Rowwise: id
#> id n min max data
#> <dbl> <dbl> <dbl> <dbl> <list>
#> 1 1 3 0 1 <dbl [3]>
#> 2 2 2 10 100 <dbl [2]>
#> 3 3 2 100 1000 <dbl [2]>
Or take advantage of
summarise()'s new capabilities and return one element per row:
df %>%
rowwise(id) %>%
summarise(x = runif(n, min, max))
#> `summarise()` regrouping output by 'id' (override with `.groups` argument)
#> # A tibble: 7 x 2
#> # Groups: id [3]
#> id x
#> <dbl> <dbl>
#> 1 1 0.579
#> 2 1 0.269
#> 3 1 0.120
#> 4 2 29.0
#> 5 2 50.8
#> 6 3 451.
#> 7 3 985.
Note that id is preserved in the output here because we defined it as an identifier variable in the call to rowwise().
vignette("rowwise") expands on this idea to show how you can generate parameter grids and vary the random distribution used in each row.
Group-wise models
The new nest_by() function works similarly to group_by() but instead of storing the grouping data as metadata, visibly changes the structure. Now we have three rows (one for each group), and we have a list-col, data, that stores the data for that group. Also note that the output is a rowwise() object; this is important because it’s going to make working with that list of data frames much easier.
by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
#> # A tibble: 3 x 2
#> # Rowwise: cyl
#> cyl data
#> <dbl> <list<tbl_df[,10]>>
#> 1 4 [11 × 10]
#> 2 6 [7 × 10]
#> 3 8 [14 × 10]
Now we can use mutate() to fit a model to each data frame:
by_cyl <- by_cyl %>% mutate(model = list(lm(mpg ~ wt, data = data)))
by_cyl
#> # A tibble: 3 x 3
#> # Rowwise: cyl
#> cyl data model
#> <dbl> <list<tbl_df[,10]>> <list>
#> 1 4 [11 × 10] <lm>
#> 2 6 [7 × 10] <lm>
#> 3 8 [14 × 10] <lm>
(Note that we need to wrap the output of lm() into a list; if you forget this, the error message will remind you.)
And then extract model summaries or coefficients with summarise() and
broom functions:
by_cyl %>% summarise(broom::glance(model))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 12
#> # Groups: cyl [3]
#> cyl r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 4 0.509 0.454 3.33 9.32 0.0137 2 -27.7 61.5 62.7
#> 2 6 0.465 0.357 1.17 4.34 0.0918 2 -9.83 25.7 25.5
#> 3 8 0.423 0.375 2.02 8.80 0.0118 2 -28.7 63.3 65.2
#> # … with 2 more variables: deviance <dbl>, df.residual <int>
by_cyl %>% summarise(broom::tidy(model))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 6
#> # Groups: cyl [3]
#> cyl term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 4 (Intercept) 39.6 4.35 9.10 0.00000777
#> 2 4 wt -5.65 1.85 -3.05 0.0137
#> 3 6 (Intercept) 28.4 4.18 6.79 0.00105
#> 4 6 wt -2.78 1.33 -2.08 0.0918
#> 5 8 (Intercept) 23.9 3.01 7.94 0.00000405
#> 6 8 wt -2.19 0.739 -2.97 0.0118