example_code/resampling_example.Rmd at main · ldp-wg-aug21/example_code · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Resampling example"
author: "Joey Burant"
date: "1/29/2022"
output: html_document
---

```{r setup, include = FALSE}

knitr::opts_chunk$set(echo = TRUE)

## required packages
library(tidyverse)
library(palmerpenguins)
library(patchwork)

```

This is a quick example of how to resample the data and plot the distributions of sampled traits, making use of the `penguins` dataset
from the `palmerpenguins` **R** package.

### Resampling the data using `dplyr::sample_()` and `base::replicate()`

We can easily randomly resample either a specific number of observations (using `dplyr::sample_n()`) or a set proportion of the data (using `dplyr::sample_frac()`). The `replicate()` function provides a really easy way to iterate this process a number of times.

The code chunk below uses two variables:

- `n_reps` = the number of desired replicates, which is passed to `replicate()` -- in this case we want 100 replicates.
- `samp_prop` = the proportion of the data to resample, which is passed to `sample_frac()` -- here we want to resample 50% of the data

By wrapping this in a tibble, we can create a tidy product in which the replicate samples are stored as a nested list that can be used for summarizing, analysis, and plotting.

```{r resampling}

## remove NAs from the dataset and add a row (record) id column
## (so we can trace a sample back to the dataset if needed)
penguins_c <- penguins %>%
  drop_na() %>%
  mutate(id = row_number(),
         .before = species)

## specify the number of replicates and the sample prop (or size)
n_reps <- 100
samp_prop <- 0.50

## resample the data and put it in a tibble
re_penguins <- tibble(
  rep = 1:n_reps,
  samp = replicate(
    n = n_reps,
    sample_frac(
      tbl = penguins_c,
      size = samp_prop,
      replace = FALSE),
    simplify = FALSE))

```

What does this look like now?

```{r look nested table}

re_penguins

```


### Summarizing the data

One way to summarize the dataset is to first `unnest()` it, then use the replicate identifier (`rep`) as a grouping variable.

In the chunk below, I first `unnest()` the data (stored in the `samp` column), then `group_by()` `rep` (100 levels), `species` (3 levels), and `sex` (2 levels). This example calculates the median value for each of the numeric columns in the dataset, before `nest`ing the summary data and `join`ing it back with the original data.

```{r summarizing nested data}

## calculate the median trait values from each sample
re_penguins <- re_penguins %>%
  unnest(samp) %>%
  group_by(rep, species, sex) %>%
  summarise(across(.cols = bill_length_mm:body_mass_g,
                   ~ median(.))) %>%
  group_by(rep) %>%
  nest() %>%
  rename(summ_mean = data) %>%
  right_join(re_penguins, ., by = "rep") %>%
  suppressMessages()

re_penguins

```

### Plotting

Finally, we can plot the distributions of the re-sampled data. The plot below overlays density distributions of `penguins$bill_length_mm` for each random sample of the dataset, facetting by `species` and `sex`.

```{r plotting the distribution of bill length}

## plot the distributions of bill_length_mm from the resamples
re_penguins %>%
  select(rep, samp) %>%
  unnest(samp) %>%
  ggplot(data = .,
         mapping = aes(x = bill_length_mm,
                       group = rep,
                       colour = sex)) +
  geom_line(stat = "density", alpha = 0.2) +
  scale_colour_brewer(palette = "Dark2") +
  # facet_wrap(~ species) +
  facet_grid(species ~ sex) +
  labs(x = "Bill length (mm)", y = "Density",
       title = "Distributions of bill length in resamples of the penguins dataset",
       subtitle = "50% of the data were randomly resampled 100 times",
       caption = "Source: palmerpenguins R package") +
  theme_minimal() +
  theme(legend.position = "none")

```

This plot provides a good starting point for thinking about how to calculate a 95% confidence interval on the distributions based on the distribution of resamples. Of course, the same approach could also be used to estimate confidence intervals on the mean/median difference between two distributions (e.g., males and females, or comparing species).

#### What about the distributions of all the traits?

```{r plot all distributions, echo = FALSE}

## use patchwork to build the multi-plot
## bill length
p1.bl <- re_penguins %>%
  select(rep, samp) %>%
  unnest(samp) %>%
  ggplot(data = .,
             mapping = aes(x = bill_length_mm,
                           group = interaction(rep, sex))) +
  geom_line(mapping = aes(colour = sex),
            stat = "density", alpha = 0.2) +
  # geom_histogram(mapping = aes(fill = sex),
  #                alpha = 0.5, position = "identity") +
  scale_colour_brewer(palette = "Dark2",
                      name = "", labels = c("female", "male")) +
  # facet_grid(species ~ measure,
  #            scales = "free") +
  facet_wrap(~ species, ncol = 1) +
  labs(x = NULL, y = "Density",
       subtitle = "Bill length (mm)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        strip.text = element_blank(),
        strip.background = element_blank())
# p1.bl

## bill depth
p2.bd <- re_penguins %>%
  select(rep, samp) %>%
  unnest(samp) %>%
  ggplot(data = .,
         mapping = aes(x = bill_depth_mm,
                       group = interaction(rep, sex))) +
  geom_line(mapping = aes(colour = sex),
            stat = "density", alpha = 0.2) +
  # geom_histogram(mapping = aes(fill = sex),
  #                alpha = 0.5, position = "identity") +
  scale_colour_brewer(palette = "Dark2",
                      name = "", labels = c("female", "male")) +
  # facet_grid(species ~ measure,
  #            scales = "free") +
  facet_wrap(~ species, ncol = 1) +
  labs(x = NULL, y = NULL,
       subtitle = "Bill depth (mm)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        strip.text = element_blank(),
        strip.background = element_blank())
# p1.bd

## flipper length
p3.fl <- re_penguins %>%
  select(rep, samp) %>%
  unnest(samp) %>%
  ggplot(data = .,
         mapping = aes(x = flipper_length_mm / 10,
                       group = interaction(rep, sex))) +
  geom_line(mapping = aes(colour = sex),
            stat = "density", alpha = 0.2) +
  # geom_histogram(mapping = aes(fill = sex),
  #                alpha = 0.5, position = "identity") +
  scale_colour_brewer(palette = "Dark2",
                      name = "", labels = c("female", "male")) +
  # facet_grid(species ~ measure,
  #            scales = "free") +
  facet_wrap(~ species, ncol = 1) +
  labs(x = NULL, y = NULL,
       subtitle = "Flipper length (cm)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        strip.text = element_blank(),
        strip.background = element_blank())
# p3.fl

## body mass
p4.bm <- re_penguins %>%
  select(rep, samp) %>%
  unnest(samp) %>%
  ggplot(data = .,
         mapping = aes(x = body_mass_g / 1000,
                       group = interaction(rep, sex))) +
  geom_line(mapping = aes(colour = sex),
            stat = "density", alpha = 0.2) +
  # geom_histogram(mapping = aes(fill = sex),
  #                alpha = 0.5, position = "identity") +
  scale_colour_brewer(palette = "Dark2",
                      name = "",
                      labels = c("female", "male"),
                      guide = guide_legend(override.aes = list(alpha = 1))) +
  # facet_grid(species ~ measure,
  #            scales = "free") +
  facet_wrap(~ species, ncol = 1, strip.position = "right") +
  labs(x = NULL, y = NULL,
       subtitle = "Body mass (kg)") +
  theme_minimal() +
  theme(legend.position = "bottom",
        legend.justification = "right",
        plot.subtitle = element_text(hjust = 0.5, face = "italic"),
        strip.text = element_text(face = "bold", size = 12))
# p4.bm

p1.bl + p2.bd + p3.fl + p4.bm +
  plot_layout(ncol = 4) +
  plot_annotation(
    title = "Distributions of trait measurements in resamples of the penguins dataset",
    subtitle = "50% of the observations were randomly resampled 100 times",
    caption = "Source: palmerpenguins R package")

ggsave("trait_distributions_resamp.pdf",
       plot = last_plot(), width = 11, height = 8, units = "in", dpi = "retina")

```

**Note**: originally tried building this plot buy pivoting the data to long format at facetting by species and measure (bill length, etc.), but `facet_grid()` doesn't allow for free y axes on each pane so density doesn't scale properly...). There's a workaround with `facet_wrap(~ a*b)` but that creates bad labels and makes it hard to compare distributions across species.