Bootstrap method for data cubes

Introduction

When working with data cubes, it’s essential to understand the uncertainty surrounding derived statistics. This tutorial introduces the bootstrap_cube() function from dubicube, which uses bootstrap resampling to estimate the variability, bias, and standard error of estimates.

Bootstrapping for data cubes

Bootstrapping is a resampling method used to approximate the distribution of a statistic by repeatedly sampling from the data with replacement. In the context of biodiversity data cubes, bootstrap_cube() enables us to assess the variability of derived statistics, such as by computing confidence intervals.

A data cube and a statistic

Consider a data cube $\mathbf{X}$ from which we want to calculate a statistic $\theta$.

Original Sample Data: $\mathbf{X} = {X_1, X_2, \ldots, X_n}$
- The initial set of data points. Here, $n$ is the sample size. This corresponds to the number of cells in a data cube or the number of rows in tabular format.
Statistic of Interest: $\theta$
- The parameter or statistic being estimated, such as the mean $\bar{X}$, variance $\sigma^2$, or a biodiversity indicator. Let $\hat{\theta}$ denote the estimated value of $\theta$ calculated from the complete dataset $\mathbf{X}$.

Resampling and recalculating

From $\mathbf{X}$, multiple resampled datasets $\mathbf{X}^$ are created by sampling rows (cells) with replacement until each new dataset is equally large as $\mathbf{X}$. This process is repeated $B$ times, and each resample yields a new estimate $\hat{\theta}^_b$ of the statistic.

Bootstrap Sample: $\mathbf{X}^* = {X_1^, X_2^, \ldots, X_n^*}$
- A sample of size $n$ drawn with replacement from the original sample $\mathbf{X}$. Each $X_i^*$ is drawn independently from $\mathbf{X}$.
- A total of $B$ bootstrap samples are drawn from the original data. Common choices for $B$ are 1000 or 10,000 to ensure a good approximation of the distribution of the bootstrap replications (see further).
Bootstrap Replication: $\hat{\theta}^*_b$
- The value of the statistic of interest calculated from the $b$-th bootstrap sample $\mathbf{X}^_b$. For example, if $\theta$ is the sample mean, $\hat{\theta}^_b = \bar{X}^*_b$.

Derivation of bootstrap statistics

The set of bootstrap replications forms the bootstrap distribution, which can be used to estimate bootstrap statistics and construct confidence intervals (see the interval calculation tutorial).

Bootstrap Estimate of the Statistic: $\hat{\theta}_{\text{boot}}$
- The average of the bootstrap replications:

$$ \hat{\theta}{\text{boot}} = \frac{1}{B} \sum{b=1}^B \hat{\theta}^*_b $$

Bootstrap Bias: $\text{Bias}_{\text{boot}}$
- This bias indicates how much the bootstrap estimate deviates from the original sample estimate. It is calculated as the difference between the average bootstrap estimate and the original estimate:

$$ \text{Bias}{\text{boot}} = \frac{1}{B} \sum{b=1}^B (\hat{\theta}^*b - \hat{\theta}) = \hat{\theta}{\text{boot}} - \hat{\theta} $$

Bootstrap Standard Error: $\text{SE}_{\text{boot}}$
- The standard deviation of the bootstrap replications, which estimates the variability of the statistic.

Getting started with dubicube

Our method can be used on any dataframe from which a statistic is calculated and a grouping variable is present. For this tutorial, we focus on occurrence cubes. Therefore, we will use the b3gbi package for processing the raw data before we go over to bootstrapping.

# Load packages
library(dubicube)

# Data loading and processing
library(frictionless) # Load example datasets
library(b3gbi)        # Process occurrence cubes

# General
library(ggplot2)      # Data visualisation
library(dplyr)        # Data wrangling
library(tidyr)        # Data wrangling

Loading and processing the data

We load the bird cube data from the b3data data package using frictionless (see also here). It is an occurrence cube for birds in Belgium between 2000 en 2024 using the MGRS grid at 10 km scale.

# Read data package
b3data_package <- read_package(
  "https://zenodo.org/records/15211029/files/datapackage.json"
)

# Load bird cube data
bird_cube_belgium <- read_resource(b3data_package, "bird_cube_belgium_mgrs10")
head(bird_cube_belgium)

We process the cube with b3gbi. First, we select 2000 random rows to make the dataset smaller. This is to reduce the computation time for this tutorial. We select the data from 2011 - 2020.

set.seed(123)

# Make dataset smaller
rows <- sample(nrow(bird_cube_belgium), 2000)
bird_cube_belgium <- bird_cube_belgium[rows, ]

# Process cube
processed_cube <- process_cube(
  bird_cube_belgium,
  first_year = 2011,
  last_year = 2020,
  cols_occurrences = "n"
)
processed_cube
#>
#> Processed data cube for calculating biodiversity indicators
#>
#> Date Range: 2011 - 2020
#> Single-resolution cube with cell size 10km ^2
#> Number of cells: 242
#> Grid reference system: mgrs
#> Coordinate range:
#>      xmin      xmax      ymin      ymax
#>  2.428844  6.329718 49.538733 51.444030
#>
#> Total number of observations: 45143
#> Number of species represented: 253
#> Number of families represented: 57
#>
#> Kingdoms represented: Data not present
#>
#> First 10 rows of data (use n = to show more):

Analysis of the data

Let’s say we are interested in the mean number of observations per grid cell per year. We create a function to calculate this.

# Function to calculate statistic of interest
# Mean observations per grid cell per year
mean_obs <- function(data) {
  obs <- x <- NULL

  data %>%
    dplyr::mutate(x = mean(obs), .by = "cellCode") %>%
    dplyr::summarise(diversity_val = mean(x), .by = "year") %>%
    as.data.frame()
}

We get the following results:

mean_obs(processed_cube$data)

On their own, these values don’t reveal how much uncertainty surrounds them. To better understand their variability, we use bootstrapping to estimate the distribution of the yearly means.

Bootstrapping

We use the bootstrap_cube() function to do this. It relies on the following arguments:

data_cube:
The input data, either as a processed data cube (from b3gbi::process_cube()), or simply the dataframe inside it (i.e. processed_cube$data). For faster computation, passing just the dataframe is recommended.
fun:
A user-defined function that computes the statistic(s) of interest from data_cube. This function should return a dataframe that includes a column named diversity_val, containing the statistic to evaluate.
grouping_var:
The column(s) used for grouping the output of fun(). For example, if fun() returns one value per year, use grouping_var = "year".
samples:
The number of bootstrap samples to draw. Common values are 1000 or more for reliable estimates of variability and confidence intervals.
seed:
An optional numeric seed to ensure reproducibility of the bootstrap resampling. Set this to a fixed value to get consistent results across runs.
progress:
Logical flag to show a progress bar. Set to TRUE to enable progress reporting; default is FALSE.

bootstrap_results <- bootstrap_cube(
  data_cube = processed_cube$data,
  fun = mean_obs,
  grouping_var = "year",
  samples = 1000,
  seed = 123
)

head(bootstrap_results)

We can visualise the bootstrap distributions using a violin plot.

# Get bias vales
bias_mean_obs <- bootstrap_results %>%
  distinct(year, estimate = est_original, `bootstrap estimate` = est_boot)

# Get estimate values
estimate_mean_obs <- bias_mean_obs %>%
  pivot_longer(cols = c("estimate", "bootstrap estimate"),
               names_to = "Legend", values_to = "value") %>%
  mutate(Legend = factor(Legend, levels = c("estimate", "bootstrap estimate"),
                         ordered = TRUE))

# Visualise bootrap distributions
bootstrap_results %>%
  ggplot(aes(x = year)) +
  # Distribution
  geom_violin(aes(y = rep_boot, group = year),
              fill = alpha("cornflowerblue", 0.2)) +
  # Estimates and bias
  geom_point(data = estimate_mean_obs, aes(y = value, shape = Legend),
             colour = "firebrick", size = 2.5) +
  # Settings
  labs(y = "Mean Number of Observations\nper Grid Cell",
       x = "", shape = "Legend:") +
  scale_x_continuous(breaks = sort(unique(bootstrap_results$year))) +
  theme_minimal() +
  theme(legend.position = "bottom",
        legend.title = element_text(face = "bold"))

Bootstrap distributions for mean number of occurrences over time.

Advanced usage of `bootstrap_cube()`

Cross-validate b3gbi functions

As stated in the documentation, it is also possible to bootstrap a processed cube generated by b3gbi::process_cube() with an indicator function of that package (e.g. evenness per year with b3gbi::pielou_evenness_ts()). However, this approach increases computation time.

bootstrap_results_evenness <- bootstrap_cube(
  data_cube = processed_cube,                 # "processed_cube" object
  fun = b3gbi::pielou_evenness_ts,            # b3gbi function
  grouping_var = "year",
  samples = 1000,
  seed = 123
)

Comparison with a reference group

A particularly insightful approach is comparing indicator values to a reference group. In time series analyses, this often means comparing each year’s indicator to a baseline year (e.g., the first or last year in the series).

To do this, we perform bootstrapping over the difference between indicator values:

Resample the dataset with replacement
Calculate the indicator for each group (e.g., each year)
For each non-reference group, compute the difference between its indicator value and that of the reference group
Repeat steps 1–3 across all bootstrap iterations

This process yields bootstrap replicate distributions of differences in indicator values.

bootstrap_results_ref <- bootstrap_cube(
  data_cube = processed_cube$data,
  fun = mean_obs,
  grouping_var = "year",
  samples = 1000,
  ref_group = 2011,
  seed = 123
)

head(bootstrap_results_ref)

We see that the mean number of observations is higher in most years compared to 2011. At what point can we say these differences are significant? This will be further explored in the effect classification tutorial.

# Get bias vales
bias_mean_obs <- bootstrap_results_ref %>%
  distinct(year, estimate = est_original, `bootstrap estimate` = est_boot)

# Get estimate values
estimate_mean_obs <- bias_mean_obs %>%
  pivot_longer(cols = c("estimate", "bootstrap estimate"),
               names_to = "Legend", values_to = "value") %>%
  mutate(Legend = factor(Legend, levels = c("estimate", "bootstrap estimate"),
                         ordered = TRUE))

# Visualise bootrap distributions
bootstrap_results_ref %>%
  ggplot(aes(x = year)) +
  # Distribution
  geom_violin(aes(y = rep_boot, group = year),
              fill = alpha("cornflowerblue", 0.2)) +
  # Estimates and bias
  geom_point(data = estimate_mean_obs, aes(y = value, shape = Legend),
             colour = "firebrick", size = 2) +
  # Settings
  labs(y = "Mean Number of Observations per Grid Cell\nCompared to 2011",
       x = "", shape = "Legend:") +
  scale_x_continuous(breaks = sort(unique(bootstrap_results_ref$year))) +
  theme_minimal() +
  theme(legend.position = "bottom",
        legend.title = element_text(face = "bold"))

Bootstrap distributions for mean number of occurrences over time (ref).

Note that the choice of the reference year should be well considered. Keep in mind which comparisons should be made, and what the motivation is behind the reference period. A high or low value in the reference period relative to other periods, e.g. an exceptional bad or good year, can affect the magnitude and direction of the calculated differences. Whether this should be avoided or not, depends on the motivation behind the choice and the research question. A reference period can be determined by legislation, or by the start of a monitoring campaign. A specific research question can determine the periods that need to be compared. Furthermore, the variability of the estimate of reference period affects the width of confidence intervals for the differences. A more variable reference period will propagate greater uncertainty. In the case of GBIF data, more data will be available in recent years than in earlier years. If this is the case, it could make sense to select the last period as a reference period. In a way, this also avoids the arbitrariness of choice for the reference period. You compare previous situations with the current situation (last year), where you could repeat this comparison annually, for example. Finally, when comparing multiple indicators, we recommend using a consistent reference period to maintain comparability