The goal of strapgod is to make it easy to create virtual groups on top of tibbles for use with resampling. This means that your tibble is grouped, but you don’t actually “materialize” the groups until you actually need them. By doing this, some computations involving large amounts of bootstraps or resamples can be made much more efficient.
There are two core functions that help you generate a resampled_df
object.
bootstrapify()
takes a data frame and bootstraps the rows of that data frame a set number of times
to generate the virtual groups.
iris_boot <- bootstrapify(iris, times = 10)
nrow(iris)
#> [1] 150
nrow(iris_boot)
#> [1] 150
iris_boot
#> # A tibble: 150 x 5
#> # Groups: .bootstrap [10]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
What you’ll immediately notice is that:
The tibble still only has 150 rows.
The tibble is now grouped by .bootstrap
, which isn’t a column in the tibble.
The invisible .bootstrap
column is the virtual group. It hasn’t been materialized (there are still only 150 rows, not 150 * 10 rows), but dplyr still seems to know about it.
samplify()
is the other function that can generate resampled tibbles. It is a slight generalization of bootstrapify()
that also allows you to specify the size of each resample, and if you want to resample with replacement or not.
iris_samp <- samplify(iris, times = 10, size = 20, replace = FALSE)
iris_samp
#> # A tibble: 150 x 5
#> # Groups: .sample [10]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
This result:
Has 10 resamples
Each one is of size 20
And the resampling was done without replacement each time
What can you do with these neat resampled data frames? Great question! For one thing, you can summarise()
the tibble to compute bootstrapped summaries quickly and efficiently.
# without the bootstrap
iris %>%
summarise(
mean_length = mean(Sepal.Length)
)
#> # A tibble: 1 x 1
#> mean_length
#> <dbl>
#> 1 5.84
# with the bootstrap
iris %>%
bootstrapify(10) %>%
summarise(
mean_length = mean(Sepal.Length)
)
#> # A tibble: 10 x 2
#> .bootstrap mean_length
#> <int> <dbl>
#> 1 1 5.90
#> 2 2 5.75
#> 3 3 5.82
#> 4 4 5.94
#> 5 5 5.82
#> 6 6 5.86
#> 7 7 5.77
#> 8 8 5.86
#> 9 9 5.80
#> 10 10 5.89
This makes it easy to compute bootstrapped estimates of individual statistics, along with bootstrapped standard deviations around those estimates.
iris %>%
bootstrapify(10) %>%
summarise(mean_length = mean(Sepal.Length)) %>%
summarise(
bootstrapped_mean = mean(mean_length),
bootstrapped_sd = sd(mean_length)
)
#> # A tibble: 1 x 2
#> bootstrapped_mean bootstrapped_sd
#> <dbl> <dbl>
#> 1 5.86 0.0524
If you want, you can take an existing grouped data frame and bootstrapify that as well, allowing you to compute bootstrapped statistics across some other variable.
iris_group_strap <- iris %>%
group_by(Species) %>%
bootstrapify(100)
iris_group_strap
#> # A tibble: 150 x 5
#> # Groups: Species, .bootstrap [300]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
Reusing the code from above, we can now compute bootstrapped estimates for the mean Sepal.Length
of each Species
, along with standard deviations around those estimates.
iris_group_strap %>%
summarise(mean_length = mean(Sepal.Length)) %>%
summarise(
bootstrapped_mean = mean(mean_length),
bootstrapped_sd = sd(mean_length)
)
#> # A tibble: 3 x 3
#> Species bootstrapped_mean bootstrapped_sd
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 0.0488
#> 2 versicolor 5.95 0.0784
#> 3 virginica 6.58 0.0815
The virtual groups are stored in the group_data()
metadata of the resampled_df
object. Every grouped data frame has one of these, and they are used internally to power the dplyr group_by()
system.
group_data(iris_boot)
#> # A tibble: 10 x 2
#> .bootstrap .rows
#> <int> <list>
#> 1 1 <int [150]>
#> 2 2 <int [150]>
#> 3 3 <int [150]>
#> 4 4 <int [150]>
#> 5 5 <int [150]>
#> 6 6 <int [150]>
#> 7 7 <int [150]>
#> 8 8 <int [150]>
#> 9 9 <int [150]>
#> 10 10 <int [150]>
The .bootstrap
column contains the unique values of the groups, and the .rows
column is a list column, where each element is an integer vector. That integer vector holds the rows that belong to that specific group. So, for .bootstrap == 1
, there is a vector with 150 integers identifying the rows belonging to that resample.
group_data(iris_boot)$.rows[[1]]
#> [1] 14 50 118 43 14 118 90 91 91 92 137 99 72 26 7 137 78 81
#> [19] 43 103 117 76 143 32 109 7 137 74 23 53 135 53 34 69 72 76
#> [37] 63 141 97 91 38 21 41 90 60 16 116 94 6 86 86 39 118 50
#> [55] 34 4 13 69 127 52 22 89 25 35 112 30 140 121 110 64 142 67
#> [73] 122 79 85 136 51 74 106 98 74 127 17 46 54 110 94 79 24 113
#> [91] 107 135 102 135 5 70 16 24 32 21 55 75 83 39 54 137 48 77
#> [109] 83 111 39 1 30 94 16 88 54 20 104 93 52 108 22 42 59 84
#> [127] 11 121 136 46 85 109 107 77 36 142 16 125 33 40 10 125 9 7
#> [145] 135 61 63 54 26 33
When a call to collect()
is made, this row index information is used to construct the output. Essentially, we start with the group_data()
and utilize the .rows
info to replicate the rows of the original data frame for each group, building up the complete resampled data frame. Notice how we now have the 150 * 10 = 1500
rows from the 10 bootstraps.
collect(iris_boot)
#> # A tibble: 1,500 x 6
#> # Groups: .bootstrap [10]
#> .bootstrap Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 4.3 3 1.1 0.1 setosa
#> 2 1 5 3.3 1.4 0.2 setosa
#> 3 1 7.7 3.8 6.7 2.2 virginica
#> 4 1 4.4 3.2 1.3 0.2 setosa
#> 5 1 4.3 3 1.1 0.1 setosa
#> 6 1 7.7 3.8 6.7 2.2 virginica
#> 7 1 5.5 2.5 4 1.3 versicolor
#> 8 1 5.5 2.6 4.4 1.2 versicolor
#> 9 1 5.5 2.6 4.4 1.2 versicolor
#> 10 1 6.1 3 4.6 1.4 versicolor
#> # … with 1,490 more rows
To learn more about collect()
, and the other supported dplyr functions in strapgod, read the vignette("dplyr-support", "strapgod")
.