Bootstrap method for data cubes
Introduction
When working with data cubes, it’s essential to understand the uncertainty surrounding derived statistics. This tutorial introduces the bootstrap_cube()
function from dubicube, which uses bootstrap resampling to estimate the variability, bias, and standard error of estimates.
Bootstrapping for data cubes
Bootstrapping is a resampling method used to approximate the distribution of a statistic by repeatedly sampling from the data with replacement. In the context of biodiversity data cubes, bootstrap_cube()
enables us to assess the variability of derived statistics, such as by computing confidence intervals.
A data cube and a statistic
Consider a data cube $\mathbf{X}$ from which we want to calculate a statistic $\theta$.
-
Original Sample Data: $\mathbf{X} = {X_1, X_2, \ldots, X_n}$
- The initial set of data points. Here, $n$ is the sample size. This corresponds to the number of cells in a data cube or the number of rows in tabular format.
-
Statistic of Interest: $\theta$
- The parameter or statistic being estimated, such as the mean $\bar{X}$, variance $\sigma^2$, or a biodiversity indicator. Let $\hat{\theta}$ denote the estimated value of $\theta$ calculated from the complete dataset $\mathbf{X}$.
Resampling and recalculating
From $\mathbf{X}$, multiple resampled datasets $\mathbf{X}^$ are created by sampling rows (cells) with replacement until each new dataset is equally large as $\mathbf{X}$. This process is repeated $B$ times, and each resample yields a new estimate $\hat{\theta}^_b$ of the statistic.
-
Bootstrap Sample: $\mathbf{X}^* = {X_1^, X_2^, \ldots, X_n^*}$
- A sample of size $n$ drawn with replacement from the original sample $\mathbf{X}$. Each $X_i^*$ is drawn independently from $\mathbf{X}$.
- A total of $B$ bootstrap samples are drawn from the original data. Common choices for $B$ are 1000 or 10,000 to ensure a good approximation of the distribution of the bootstrap replications (see further).
-
Bootstrap Replication: $\hat{\theta}^*_b$
- The value of the statistic of interest calculated from the $b$-th bootstrap sample $\mathbf{X}^_b$. For example, if $\theta$ is the sample mean, $\hat{\theta}^_b = \bar{X}^*_b$.
Derivation of bootstrap statistics
The set of bootstrap replications forms the bootstrap distribution, which can be used to estimate bootstrap statistics and construct confidence intervals (see the interval calculation tutorial).
- Bootstrap Estimate of the Statistic: $\hat{\theta}_{\text{boot}}$
- The average of the bootstrap replications:
$$ \hat{\theta}{\text{boot}} = \frac{1}{B} \sum{b=1}^B \hat{\theta}^*_b $$
- Bootstrap Bias: $\text{Bias}_{\text{boot}}$
- This bias indicates how much the bootstrap estimate deviates from the original sample estimate. It is calculated as the difference between the average bootstrap estimate and the original estimate:
$$ \text{Bias}{\text{boot}} = \frac{1}{B} \sum{b=1}^B (\hat{\theta}^*b - \hat{\theta}) = \hat{\theta}{\text{boot}} - \hat{\theta} $$
- Bootstrap Standard Error: $\text{SE}_{\text{boot}}$
- The standard deviation of the bootstrap replications, which estimates the variability of the statistic.
Getting started with dubicube
Our method can be used on any dataframe from which a statistic is calculated and a grouping variable is present. For this tutorial, we focus on occurrence cubes. Therefore, we will use the b3gbi package for processing the raw data before we go over to bootstrapping.
# Load packageslibrary(dubicube)
# Data loading and processinglibrary(frictionless) # Load example datasetslibrary(b3gbi) # Process occurrence cubes
# Generallibrary(ggplot2) # Data visualisationlibrary(dplyr) # Data wranglinglibrary(tidyr) # Data wrangling
Loading and processing the data
We load the bird cube data from the b3data data package using frictionless (see also here). It is an occurrence cube for birds in Belgium between 2000 en 2024 using the MGRS grid at 10 km scale.
# Read data packageb3data_package <- read_package( "https://zenodo.org/records/15211029/files/datapackage.json")
# Load bird cube databird_cube_belgium <- read_resource(b3data_package, "bird_cube_belgium_mgrs10")head(bird_cube_belgium)
We process the cube with b3gbi. First, we select 2000 random rows to make the dataset smaller. This is to reduce the computation time for this tutorial. We select the data from 2011 - 2020.
set.seed(123)
# Make dataset smallerrows <- sample(nrow(bird_cube_belgium), 2000)bird_cube_belgium <- bird_cube_belgium[rows, ]
# Process cubeprocessed_cube <- process_cube( bird_cube_belgium, first_year = 2011, last_year = 2020, cols_occurrences = "n")processed_cube#>#> Processed data cube for calculating biodiversity indicators#>#> Date Range: 2011 - 2020#> Single-resolution cube with cell size 10km ^2#> Number of cells: 242#> Grid reference system: mgrs#> Coordinate range:#> xmin xmax ymin ymax#> 2.428844 6.329718 49.538733 51.444030#>#> Total number of observations: 45143#> Number of species represented: 253#> Number of families represented: 57#>#> Kingdoms represented: Data not present#>#> First 10 rows of data (use n = to show more):
Analysis of the data
Let’s say we are interested in the mean number of observations per grid cell per year. We create a function to calculate this.
# Function to calculate statistic of interest# Mean observations per grid cell per yearmean_obs <- function(data) { obs <- x <- NULL
data %>% dplyr::mutate(x = mean(obs), .by = "cellCode") %>% dplyr::summarise(diversity_val = mean(x), .by = "year") %>% as.data.frame()}
We get the following results:
mean_obs(processed_cube$data)
On their own, these values don’t reveal how much uncertainty surrounds them. To better understand their variability, we use bootstrapping to estimate the distribution of the yearly means.
Bootstrapping
We use the bootstrap_cube()
function to do this. It relies on the following arguments:
-
data_cube
:
The input data, either as a processed data cube (fromb3gbi::process_cube()
), or simply the dataframe inside it (i.e.processed_cube$data
). For faster computation, passing just the dataframe is recommended. -
fun
:
A user-defined function that computes the statistic(s) of interest fromdata_cube
. This function should return a dataframe that includes a column nameddiversity_val
, containing the statistic to evaluate. -
grouping_var
:
The column(s) used for grouping the output offun()
. For example, iffun()
returns one value per year, usegrouping_var = "year"
. -
samples
:
The number of bootstrap samples to draw. Common values are 1000 or more for reliable estimates of variability and confidence intervals. -
seed
:
An optional numeric seed to ensure reproducibility of the bootstrap resampling. Set this to a fixed value to get consistent results across runs. -
progress
:
Logical flag to show a progress bar. Set toTRUE
to enable progress reporting; default isFALSE
.
bootstrap_results <- bootstrap_cube( data_cube = processed_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, seed = 123)
head(bootstrap_results)
We can visualise the bootstrap distributions using a violin plot.
# Get bias valesbias_mean_obs <- bootstrap_results %>% distinct(year, estimate = est_original, `bootstrap estimate` = est_boot)
# Get estimate valuesestimate_mean_obs <- bias_mean_obs %>% pivot_longer(cols = c("estimate", "bootstrap estimate"), names_to = "Legend", values_to = "value") %>% mutate(Legend = factor(Legend, levels = c("estimate", "bootstrap estimate"), ordered = TRUE))
# Visualise bootrap distributionsbootstrap_results %>% ggplot(aes(x = year)) + # Distribution geom_violin(aes(y = rep_boot, group = year), fill = alpha("cornflowerblue", 0.2)) + # Estimates and bias geom_point(data = estimate_mean_obs, aes(y = value, shape = Legend), colour = "firebrick", size = 2.5) + # Settings labs(y = "Mean Number of Observations\nper Grid Cell", x = "", shape = "Legend:") + scale_x_continuous(breaks = sort(unique(bootstrap_results$year))) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_text(face = "bold"))

Advanced usage of bootstrap_cube()
Cross-validate b3gbi functions
As stated in the documentation, it is also possible to bootstrap a processed cube generated by b3gbi::process_cube()
with an indicator function of that package (e.g. evenness per year with b3gbi::pielou_evenness_ts()
).
However, this approach increases computation time.
bootstrap_results_evenness <- bootstrap_cube( data_cube = processed_cube, # "processed_cube" object fun = b3gbi::pielou_evenness_ts, # b3gbi function grouping_var = "year", samples = 1000, seed = 123)
Comparison with a reference group
A particularly insightful approach is comparing indicator values to a reference group. In time series analyses, this often means comparing each year’s indicator to a baseline year (e.g., the first or last year in the series).
To do this, we perform bootstrapping over the difference between indicator values:
- Resample the dataset with replacement
- Calculate the indicator for each group (e.g., each year)
- For each non-reference group, compute the difference between its indicator value and that of the reference group
- Repeat steps 1–3 across all bootstrap iterations
This process yields bootstrap replicate distributions of differences in indicator values.
bootstrap_results_ref <- bootstrap_cube( data_cube = processed_cube$data, fun = mean_obs, grouping_var = "year", samples = 1000, ref_group = 2011, seed = 123)
head(bootstrap_results_ref)
We see that the mean number of observations is higher in most years compared to 2011. At what point can we say these differences are significant? This will be further explored in the effect classification tutorial.
# Get bias valesbias_mean_obs <- bootstrap_results_ref %>% distinct(year, estimate = est_original, `bootstrap estimate` = est_boot)
# Get estimate valuesestimate_mean_obs <- bias_mean_obs %>% pivot_longer(cols = c("estimate", "bootstrap estimate"), names_to = "Legend", values_to = "value") %>% mutate(Legend = factor(Legend, levels = c("estimate", "bootstrap estimate"), ordered = TRUE))
# Visualise bootrap distributionsbootstrap_results_ref %>% ggplot(aes(x = year)) + # Distribution geom_violin(aes(y = rep_boot, group = year), fill = alpha("cornflowerblue", 0.2)) + # Estimates and bias geom_point(data = estimate_mean_obs, aes(y = value, shape = Legend), colour = "firebrick", size = 2) + # Settings labs(y = "Mean Number of Observations per Grid Cell\nCompared to 2011", x = "", shape = "Legend:") + scale_x_continuous(breaks = sort(unique(bootstrap_results_ref$year))) + theme_minimal() + theme(legend.position = "bottom", legend.title = element_text(face = "bold"))

Note that the choice of the reference year should be well considered. Keep in mind which comparisons should be made, and what the motivation is behind the reference period. A high or low value in the reference period relative to other periods, e.g. an exceptional bad or good year, can affect the magnitude and direction of the calculated differences. Whether this should be avoided or not, depends on the motivation behind the choice and the research question. A reference period can be determined by legislation, or by the start of a monitoring campaign. A specific research question can determine the periods that need to be compared. Furthermore, the variability of the estimate of reference period affects the width of confidence intervals for the differences. A more variable reference period will propagate greater uncertainty. In the case of GBIF data, more data will be available in recent years than in earlier years. If this is the case, it could make sense to select the last period as a reference period. In a way, this also avoids the arbitrariness of choice for the reference period. You compare previous situations with the current situation (last year), where you could repeat this comparison annually, for example. Finally, when comparing multiple indicators, we recommend using a consistent reference period to maintain comparability