Simulating the occurrence process
The workflow for simulating a biodiversity data cube used in gcube can be divided in three steps or processes:
- Occurrence process
- Detection process
- Grid designation process
This tutorial documents the first part of the gcube simulation workflow, viz. the occurrence process.
# Load packageslibrary(gcube)
library(sf) # work with spatial objectslibrary(dplyr) # data wranglinglibrary(ggplot2) # data visualisationlibrary(tidyterra) # visualisation spatraster objects
Input
The functions are set up such that a single polygon as input is enough to go through this workflow using default arguments. The user can change these arguments to allow for more flexibility. In this tutorial we will demonstrate the different options.
As input, we create a polygon in which we want to simulate occurrences. It represents the spatial extend of the species.
polygon <- st_polygon(list(cbind(c(500, 1000, 1000, 600, 200, 100, 500), c(200, 100, 700, 1000, 900, 500, 200))))
The polygon looks like this.
ggplot() + geom_sf(data = polygon) + theme_minimal()

Simulate occurrences
We generate occurrence points within the polygon using the simulate_occurrences()
function.
Default arguments ensure that an sf object with POLYGON geometry is sufficient to simulate occurrences.
?simulate_occurrences
The options for user defined arguments are demonstrated in the next subsections.
Changing number of occurrences over time
Say we want to have 100 occurrences in our plot over 10 years.
You can change the trend in the average number of occurrences over time.
We visualise this with the supporting functions used in simulate_occurrences()
.
The number of occurrences are always drawn from a Poisson distribution.
?simulate_timeseries
Option 1
If we do not specify a temporal function, we draw from a Poisson distribution for each time point with average (lambda parameter) initial_average_occurrences
.
n_occurrences_indep <- simulate_timeseries( initial_average_occurrences = 100, n_time_points = 10, temporal_function = NA, seed = 123)
We plot the simulated number of occurrences over time. We see that the average is close to 100 over time as expected. Using a different seed will result in different numbers but the average will be (close to) 100 over time.
tibble( n_occurrences = n_occurrences_indep, time_point = seq_along(n_occurrences_indep)) %>% ggplot(aes(x = time_point, y = n_occurrences)) + geom_point() + geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) + theme_minimal()

Option 2
We can specify a function ourselves, e.g. the internal function simulate_random_walk()
to have a random walk over time.
A random walk is a mathematical concept where each step is determined randomly.
The sd_step
parameter refers to the standard deviation of these random steps (drawn from a Normal distribution).
A higher value leading to larger steps and potentially greater variability in the path of the random walk.
n_occurrences_walk <- simulate_timeseries( initial_average_occurrences = 100, n_time_points = 10, temporal_function = simulate_random_walk, sd_step = 1, seed = 123)
We plot the simulated number of occurrences over time which follow a random walk. Using a different seed will result in a different random pattern.
tibble( n_occurrences = n_occurrences_walk, time_point = seq_along(n_occurrences_walk)) %>% ggplot(aes(x = time_point, y = n_occurrences)) + geom_point() + theme_minimal()

Option 3
We can specify a function ourselves that determines the average trend in number of occurrences over time. Here we provide an example for a linear trend.
my_own_linear_function <- function( initial_average_occurrences = initial_average_occurrences, n_time_points = n_time_points, coef) { # Calculate new average abundances over time time <- seq_len(n_time_points) - 1 lambdas <- initial_average_occurrences + (coef * time)
# Identify where the lambda values become 0 or lower zero_or_lower_index <- which(lambdas <= 0)
# If any lambda becomes 0 or lower, set all subsequent lambdas to 0 if (length(zero_or_lower_index) > 0) { zero_or_lower_indices <- zero_or_lower_index[1]:n_time_points lambdas[zero_or_lower_indices] <- 0 }
# Return average abundances return(lambdas)}
We try out a linear trend with slope equal to 1.
n_occurrences_linear <- simulate_timeseries( initial_average_occurrences = 100, n_time_points = 10, temporal_function = my_own_linear_function, coef = 1, seed = 123)
We plot the simulated number of occurrences over time. We see that the average slope is indeed close to 1. Using a different seed will result in different numbers but the average slope will be (close to) 1.
tibble( n_occurrences = n_occurrences_linear, time_point = seq_along(n_occurrences_linear)) %>% ggplot(aes(x = time_point, y = n_occurrences)) + geom_point() + geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) + theme_minimal()

Changing the degree of spatial clustering
We can also choose the amount of spatial clustering.
We visualise this with the supporting functions used in simulate_occurrences()
.
?create_spatial_pattern
Option 1
There are defaults for random and clustered patterns. Let’s look at the default where we have no clustering.
rs_pattern_random <- create_spatial_pattern( polygon = polygon, resolution = 10, spatial_pattern = "random", seed = 123)#> [using unconditional Gaussian simulation]
We see values of high sampling probability randomly distributed.
ggplot() + geom_spatraster(data = rs_pattern_random) + scale_fill_continuous(type = "viridis") + theme_minimal()#> Registered S3 method overwritten by 'data.table':#> method from#> print.data.table

Option 2
Let’s look at the default where we have clustering (same as spatial_pattern = 10
, see further).
rs_pattern_clustered <- create_spatial_pattern( polygon = polygon, resolution = 10, spatial_pattern = "clustered", seed = 123)#> [using unconditional Gaussian simulation]
We see values of high sampling probability clustered together.
ggplot() + geom_spatraster(data = rs_pattern_clustered) + scale_fill_continuous(type = "viridis") + theme_minimal()

Option 3
We can also change the clustering ourselves.
A larger number for spatial_pattern
means a broader size of the clusters area.
Let’s look at a low value for clustering.
rs_pattern_clustered2 <- create_spatial_pattern( polygon = polygon, resolution = 10, spatial_pattern = 5, seed = 123)#> [using unconditional Gaussian simulation]
We see values of high sampling probability in multiple, smaller clusters.
ggplot() + geom_spatraster(data = rs_pattern_clustered2) + scale_fill_continuous(type = "viridis") + theme_minimal()

Let’s look at a high value for clustering.
rs_pattern_clustered3 <- create_spatial_pattern( polygon = polygon, resolution = 10, spatial_pattern = 100, seed = 123)#> [using unconditional Gaussian simulation]
We see values of high sampling probability in fewer, larger clusters.
ggplot() + geom_spatraster(data = rs_pattern_clustered3) + scale_fill_continuous(type = "viridis") + theme_minimal()

The patterns generated above are then used for sampling using a different supporting function.
?sample_occurrences_from_raster
If we for example sample 500 occurrences from the last raster, we see the sampling is according to the expected pattern.
pts_occ_clustered3 <- sample_occurrences_from_raster( raster = rs_pattern_clustered3, time_series = 500, seed = 123)
ggplot() + geom_spatraster(data = rs_pattern_clustered3) + geom_sf(data = pts_occ_clustered3) + scale_fill_continuous(type = "viridis") + theme_minimal()

Example
Now that we know how the supporting functions work, we can generate occurrence points within the polygon using the simulate_occurrences()
function.
We can for example sample randomly within the polygon over 6 time points were we use a random walk over time with an initial average number of occurrences equal to 100.
occurrences_df <- simulate_occurrences( species_range = polygon, initial_average_occurrences = 100, n_time_points = 6, temporal_function = simulate_random_walk, sd_step = 1, spatial_pattern = "random", seed = 123)#> [using unconditional Gaussian simulation]
This is the number of occurrences we have for each time point.
occurrences_df %>% st_drop_geometry() %>% count(time_point) %>% ggplot(aes(x = time_point, y = n)) + geom_point() + theme_minimal()

This is the spatial distribution of the occurrences for each time point.
ggplot() + geom_sf(data = polygon) + geom_sf(data = occurrences_df) + facet_wrap(~time_point, nrow = 2) + ggtitle("Distribution of occurrences for each time point") + theme_minimal()
