Skip to content

Whole-cube bootstrap versus group-specific bootstrap

Introduction

When calculating biodiversity indicators from a cube, we often want confidence intervals (CIs) using bootstrapping. In dubicube, bootstrapping can be done in two ways:

  • Whole-cube bootstrapping: resampling all rows in the cube, regardless of grouping.
  • Group-specific bootstrapping: resampling rows only within a group of interest (e.g., a species, year, or habitat).

The choice between these two methods directly affects how confidence intervals should be interpreted:

  • Other indicators combine information across groups (e.g., community richness, turnover, or multi-species metrics). These require whole-cube bootstrapping to preserve correlations.
  • Some indicators are calculated independently per group (e.g., species-specific or year-specific metrics). For these, group-specific bootstrapping is usually more appropriate.

In this tutorial, we explain the differences, discuss the strengths and limitations of each method, and provide a worked example where group-specific bootstrapping is the correct choice.

Whole-cube bootstrap

Definition: Resample all rows in the cube, regardless of species, year, or other grouping.

Advantages:

  • Preserves correlations between groups (e.g., species co-occurrence, temporal dependencies).
  • Appropriate for indicators that depend on multiple groups together (community-level metrics, multi-species diversity).

Disadvantages:

  • Rare groups may end up with zero rows in some bootstrap replicates, leading to wider or undefined CIs.
  • Variance for small groups may be inflated.

Use case examples:

  • Community richness per site or habitat.
  • Multi-species indicators (e.g., average occupancy across species).
  • Temporal turnover indicators that rely on multiple years.

Implementation in dubicube

  • Default use like tutorials and examples.

Group-specific bootstrap

Definition: Subset the cube by the group of interest (e.g., species or year), then resample rows only within that group.

Advantages:

  • Guarantees each replicate has rows for the group → stable CIs.
  • Reflects within-group variability only.

Disadvantages:

  • Ignores correlations with other groups.
  • Variance may be slightly underestimated if the group’s presence is correlated with other groups.

Use case examples:

  • Species-specific occupancy or habitat preference metrics.
  • Year-specific indicators (e.g., annual richness).
  • Small or rare groups where zero-row replicates would be problematic.

Implementation in dubicube

  • Perform bootstrapping and interval calculation per group (e.g. using a for loop or lapply()).
  • See further.

An example of a group-specific analysis

We reuse the example introduced in bootstrap confidence interval calculation tutorial where we calculate confidence limits for the mean number of observations per grid cell per year for birds in Belgium between 2011 en 2020 using the MGRS grid at 10 km scale.

# Load packages
library(ggplot2) # Data visualisation
library(dplyr) # Data wrangling
# Data loading and processing
library(frictionless) # Load example datasets
library(b3gbi) # Process occurrence cubes
library(dubicube) # Analysis of data quality & indicator uncertainty

Loading and processing the data

We load the bird cube data from the b3data data package using frictionless (see also here).

# Read data package
b3data_package <- read_package(
"https://zenodo.org/records/15211029/files/datapackage.json"
)
# Load bird cube data
bird_cube_belgium <- read_resource(b3data_package, "bird_cube_belgium_mgrs10")
head(bird_cube_belgium)
#> # A tibble: 6 × 8
#> year mgrscode specieskey species family n mincoordinateuncertain…¹ familycount
#> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2000 31UDS65 2473958 Perdix perdix Phasianidae 1 3536 261414
#> 2 2000 31UDS65 2474156 Coturnix coturnix Phasianidae 1 3536 261414
#> 3 2000 31UDS65 2474377 Fulica atra Rallidae 5 1000 507437
#> 4 2000 31UDS65 2475443 Merops apiaster Meropidae 6 1000 1655
#> 5 2000 31UDS65 2480242 Vanellus vanellus Charadriidae 1 3536 294808
#> 6 2000 31UDS65 2480637 Accipiter nisus Accipitridae 1 3536 855924
#> # ℹ abbreviated name: ¹​mincoordinateuncertaintyinmeters

We process the cube with b3gbi. First, we select 2000 random rows to make the dataset smaller. This is to reduce the computation time for this tutorial. We select the data from 2011 - 2020.

set.seed(123)
# Make dataset smaller
rows <- sample(nrow(bird_cube_belgium), 2000)
bird_cube_belgium <- bird_cube_belgium[rows, ]
# Process cube
processed_cube <- process_cube(
bird_cube_belgium,
first_year = 2011,
last_year = 2020,
cols_occurrences = "n"
)
processed_cube
#>
#> Processed data cube for calculating biodiversity indicators
#>
#> Date Range: 2011 - 2020
#> Single-resolution cube with cell size 10km ^2
#> Number of cells: 242
#> Grid reference system: mgrs
#> Coordinate range:
#> xmin xmax ymin ymax
#> 280000 710000 5490000 5700000
#>
#> Total number of observations: 45143
#> Number of species represented: 253
#> Number of families represented: 57
#>
#> Kingdoms represented: Data not present
#>
#> First 10 rows of data (use n = to show more):
#>
#> # A tibble: 957 × 13
#> year cellCode taxonKey scientificName family obs minCoordinateUncerta…¹ familyCount xcoord
#> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2011 31UFS56 5231918 Cuculus canorus Cucul… 11 3536 67486 650000
#> 2 2011 31UES28 5739317 Phoenicurus phoeni… Musci… 6 3536 610513 520000
#> 3 2011 31UFS64 6065824 Chroicocephalus ri… Larid… 143 1000 2612978 660000
#> 4 2011 31UFS96 2492576 Muscicapa striata Musci… 3 3536 610513 690000
#> 5 2011 31UES04 5231198 Passer montanus Passe… 1 3536 175872 500000
#> 6 2011 31UES85 5229493 Garrulus glandarius Corvi… 23 707 816442 580000
#> 7 2011 31UES88 10124612 Anser anser x Bran… Anati… 1 100 2709975 580000
#> 8 2011 31UES22 2481172 Larus marinus Larid… 8 1000 2612978 520000
#> 9 2011 31UFS43 2481139 Larus argentatus Larid… 10 3536 2612978 640000
#> 10 2011 31UFT00 9274012 Spatula querquedula Anati… 8 3536 2709975 600000
#> # ℹ 947 more rows
#> # ℹ abbreviated name: ¹​minCoordinateUncertaintyInMeters
#> # ℹ 4 more variables: ycoord <dbl>, utmzone <int>, hemisphere <chr>, resolution <chr>

Analysis of the data

Let’s say we are interested in the mean number of observations per grid cell per year. We create a function to calculate this.

In contrast to the other tutorials, we now calculate this exclusively per year. Therefore, the group-specific bootstrap is more appropriate here. This is because our indicator (mean observations per year) depends only on variation within each year, and not on correlations across years.

# Function to calculate statistic of interest
# Mean observations per grid cell per year
mean_obs <- function(data) {
data %>%
group_by(year, cellCode) %>%
dplyr::mutate(x = mean(obs)) %>%
ungroup() %>%
dplyr::summarise(diversity_val = mean(x), .by = "year") %>%
as.data.frame()
}

We get the following results:

mean_obs(processed_cube$data)
#> year diversity_val
#> 1 2011 31.28846
#> 2 2012 30.96842
#> 3 2013 36.65049
#> 4 2014 42.17143
#> 5 2015 45.76471
#> 6 2016 39.76068
#> 7 2017 119.70833
#> 8 2018 49.87963
#> 9 2019 21.06780
#> 10 2020 19.78689

On their own, these values don’t reveal how much uncertainty surrounds them. To better understand their variability, we use bootstrapping to estimate the distribution of the yearly means. From this distribution, we can calculate bootstrap confidence intervals.

Group-specific bootstrapping

We use the bootstrap_cube() function to perform bootstrapping (see also the bootstrap tutorial). However, we now split the data per year and perform bootstrapping per year using lapply(). We store the data also in the output list to be used for interval calculation further on.

bootstrap_results_group <- processed_cube$data %>%
split(processed_cube$data$year) %>%
lapply(function(cube) {
bootstrap_results <- bootstrap_cube(
data_cube = cube,
fun = mean_obs,
grouping_var = "year",
samples = 1000,
seed = 123,
processed_cube = FALSE
)
return(list(bootstrap_results = bootstrap_results, data = cube))
})
head(bootstrap_results_group[[1]]$bootstrap_results)
#> sample year est_original rep_boot est_boot se_boot bias_boot
#> 1 1 2011 31.28846 34.61538 31.40296 5.830663 0.1145
#> 2 2 2011 31.28846 20.95192 31.40296 5.830663 0.1145
#> 3 3 2011 31.28846 20.69231 31.40296 5.830663 0.1145
#> 4 4 2011 31.28846 29.25000 31.40296 5.830663 0.1145
#> 5 5 2011 31.28846 34.47115 31.40296 5.830663 0.1145
#> 6 6 2011 31.28846 32.90385 31.40296 5.830663 0.1145

Group-specific interval calculation

Now we can use the calculate_bootstrap_ci() function to calculate confidence limits (see also the bootstrap confidence interval calculation tutorial). Because BCa interval calculation relies on jackknifing, we also need to do this per group using lapply().

ci_mean_obs_group_list <- bootstrap_results_group %>%
lapply(function(list) {
calculate_bootstrap_ci(
list$bootstrap_results,
grouping_var = "year",
type = c("perc", "bca", "norm", "basic"),
conf = 0.95,
data_cube = list$data, # Required for Bca
fun = mean_obs # Required for Bca
)
})
# Make interval type factor
ci_mean_obs_group <- bind_rows(ci_mean_obs_group_list) %>%
mutate(
int_type = factor(
int_type, levels = c("perc", "bca", "norm", "basic")
)
)
head(ci_mean_obs_group)
#> year est_original est_boot se_boot bias_boot int_type conf ll ul
#> 1 2011 31.28846 31.40296 5.830663 0.11450000 perc 0.95 20.83010 43.83092
#> 2 2011 31.28846 31.40296 5.830663 0.11450000 bca 0.95 22.28181 46.74330
#> 3 2011 31.28846 31.40296 5.830663 0.11450000 norm 0.95 19.74607 42.60185
#> 4 2011 31.28846 31.40296 5.830663 0.11450000 basic 0.95 18.74600 41.74682
#> 5 2012 30.96842 31.02205 4.566094 0.05363158 perc 0.95 22.80027 40.21998
#> 6 2012 30.96842 31.02205 4.566094 0.05363158 bca 0.95 23.28024 41.32107

We visualise the results.

ci_mean_obs_group %>%
ggplot(aes(x = year, y = est_original)) +
# Intervals
geom_errorbar(aes(ymin = ll, ymax = ul),
position = position_dodge(0.8), linewidth = 0.8) +
# Estimates
geom_point(colour = "firebrick", size = 2) +
# Settings
labs(y = "Mean Number of Observations\nper Grid Cell") +
scale_x_continuous(breaks = sort(unique(ci_mean_obs_group$year))) +
theme_minimal() +
facet_wrap(~int_type)
Confidence intervals for mean number of occurrences over time.