The Theory of Island Biogeography by MacArthur and Wilson (1967) is one of the most influential and successful theories in ecology, a rich source of inspiration for new research and ideas since it was first suggested. MacArthur and Wilson originally proposed that the number of species in an island is determined by the size of the island and its distance from the mainland. These factors result in a dynamic equilibrium between colonization from the mainland and extinction of species once they arrive on the island. The theory initially presented by MacArthur and Wilson was called ETIB (Simberloff 1974). Wilson and Simberloff’s (1969) study of the experimental defaunation of mangrove islands in the Florida Keys was the first validation of the theory, which explicitly emphasizes its dynamic aspects, and it showed how different mangrove islands reached an equilibrium with a species richness equivalent to the number of species present before the experimental defaunation.

The package `island`

follows the stochastic implementation of Simberloff’s model (1969) developed by Alonso *et al.* (2015), that uses a likelihood approach to estimate colonization and extinction rates for communities that have been repeatedly sampled through time.

This package allows:

1. Estimating colonization and extinction rates with regular and irregular sampling schemes, for whole communities or groups within those communities, taking into account or not imperfect detectability.

2. Converting those rates into transition probabilities.

3. Performing Akaike Information Criterion (AIC) -based model selection.

4. Estimating the influence of environmental variables on colonization and extinction dynamics.

5. Simulating the dynamics of the Equilibrium Theory of Island Biogeography, as well as three other population models.

6. Evaluating model error and R².

In order to estimate colonization and extinction rates with package `island`

, we need a dataframe of repeatedly sampled communities, organised in time and with at least one species (or the relevant taxonomic unit for our analysis). In this package we adopt the convention of indicating sampling times in columns and species in rows of the dataframe. As an example, take a look at the dataframe `data(alonso15)`

, that uses a regular sampling scheme for 4 years (2000 to 2003).

Species | Trophic.Level | Kd.2000 | Kd.2001 | Kd.2002 | Kd.2003 | Kd.2010 | Kd.2011 | Guild |
---|---|---|---|---|---|---|---|---|

Acanthurus auranticavus | NA | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus leucosternon | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus lineatus | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus nigrofuscus | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus triostegus | 2,78 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus xanthopterus | 2,87 | 0 | 0 | 0 | 1 | 1 | 1 | Algal Feeder |

In the case of irregular sampling schemes, we adopt a convention of recording subsequent time intervals in the column header. For example, if our sampling started on day 17, and the next sample was taken 20 days after, that is, on day 37, the column heads for those samples should read 17 and 37. These conventions have been followed in `data(simberloff)`

, extracted here:

Taxa | PRE | 31 | 51 | 69 | 258 | 276 | 295 | 371 | Tax. Unit 1 | Tax. Unit 2 | Genera | Island |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Gen. Sp. | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Embioptera | Teratembiidae | Unknown | E2 |

Aglaopteryx sp. | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Orthoptera | Blattidae | Aglaopteryx | E2 |

Latiblattella n. sp. | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | Orthoptera | Blattidae | Latiblattella | E2 |

Latiblattella rehni | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | Orthoptera | Blattidae | Latiblattella | E2 |

Cycloptilum sp. | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Orthoptera | Gryllidae | Cycloptilum | E2 |

Cyrtoxipha sp. | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | Orthoptera | Gryllidae | Cyrtoxipha | E2 |

In both examples, we can see that we have columns for the taxa (or species) studied, some columns with additional information (such as Guild or Taxonomic Unit considered), and consecutive columns with data on presence (1) or absence (0).

Studying ecological communities can be messy. Temporal studies that track changes over many sites may often result in sampling schemes that are irregular both in time and space. In cases where we have different sampling schemes in our data, we require the different sampling schemes to be treated as different dataframes inside a list. An example would be the data set `simberloff`

.

The seminal work of Wilson & Simberloff (1969) on the experimental defaunation of mangrove islands in the Florida Keys was the first experimental validation of the ETIB, that predicted that colonization and extinction on an island would balance out and reach a dynamic equilibrium. The basic model underlying the theory can be expressed as follows: \[ \frac{dS_s}{dt} = c(S_P - S_s) - eS_s \] where \(S_s\) is the number of species present at a site, \(S_P\) is the number of species in the regional pool, and \(c\) and \(e\) are colonization and extinction rates respectively. This model states that the temporal variation in the number of species in a site over time is equal to the total rate of arrival of species from the pool that are not already at the site minus the total rate of species loss from the site. The equation can be easily solved for a single species and we can compute the probability of a species being present or absent at any time (see Alonso *et al.* 2015). With these probabilities we can construct a matrix of transition probabilities for the associated Markov model: \[\begin{equation}
\label{T_00}
\left\{
\begin{array}{ccc}
T_{00} &=& 1 - \frac{c}{e+c} ( 1 - \exp(-(e+c)\Delta t))\\
T_{10} &=& \frac{c}{e+c} ( 1 - \exp(-(e+c)\Delta t) )\\
T_{01} &=& \frac{e}{e+c} ( 1 - \exp(-(e+c)\Delta t) ) \\
T_{11} &=& 1 - \frac{e}{e+c} ( 1 - \exp(-(e+c)\Delta t))
\end{array}
\right.
\end{equation}\] where \(T_{01}\) is the probability of extinction, \(T_{10}\) the probability of colonization, \(T_{11}\) of repeated presence and \(T_{00}\) of repeated absence. Two key assumptions, the assumption of *species equivalence* (the same parameters apply for a group of species), and the assumption of *species independence* (the presence of other species does not affect the dynamics of a focal species), allow to generalize this Markov process to groups of species or even communities. We can therefore calculate the likelihood of a given data set under the colonization-extinction model with the following expression for a regular sampling scheme given specific colonization and extinction rates, *c* and *e*: \[\begin{equation}
P( M | c, e) = (1-T_{10})^{N_{00}} T_{10}^{N_{10}} T_{01}^{N_{01}} (1-T_{01})^{N_{11}}
\end{equation}\] where \(N_{00}\) is the number of events of repeated absence, \(N_{10}\) events of colonization, \(N_{01}\) events of extinction and \(N_{11}\) events of repeated presence. In the case of an irregular sampling scheme we need to calculate the transition probabilities and the matching likelihood for each transition.

As an example, imagine that we have sampled arthropods on an island for three seasons and we have the following data:

Sp. | 1 | 2 | 3 |
---|---|---|---|

A | 0 | 0 | 1 |

B | 0 | 1 | 1 |

C | 1 | 1 | 1 |

D | 0 | 0 | 0 |

E | 0 | 1 | 0 |

F | 0 | 0 | 0 |

G | 1 | 0 | 0 |

H | 0 | 0 | 0 |

I | 1 | 1 | 0 |

J | 0 | 0 | 0 |

Here we can see that between samples 1 and 2 we have \(N_{01} = 1\) (G), \(N_{11} = 2\) (C, I), \(N_{00} = 5\) (A, D, F, H, J), and \(N_{10} = 2\) (B, E); we can similarly calculate transitions between samples 2 and 3. Assuming we already know the true value of transition probabilities (\(T_{01} = 0.4\), \(T_{11} = 0.6\), \(T_{00} = 0.7\) and \(T_{10} = 0.3\)), we can easily calculate the likelihood of the observed dataframe above as:

```
### LIKELIHOOD
0.4^3 * 0.6^4 * 0.7^10 * 0.3^3
#> [1] 6.325999e-06
# This is a pretty small likelihood. We can easily transform this likelihood into log-likelihood.
### LOG-LIKELIHOOD
3 * log(0.4) + 4 * log(0.6) + 10 * log(0.7) + 3 * log(0.3)
#> [1] -11.97084
# This is the log-likelihood associated with the given set of transition probabilities. Notice that we obtain the same value with both approaches.
log(0.4^3 * 0.6^4 * 0.7^10 * 0.3^3)
#> [1] -11.97084
```

The method above allows us to calculate the likelihood of our data set from known transition probabilities. However, say that we do not know the transition probabilities and want to estimate them. We can do this easily with the following code:

```
<- regular_sampling_scheme(x = df, vector = 2:4)
c_and_e
c_and_e#> c c_up c_low e e_up e_low N NLL
#> 1 0.3769053 NA NA 0.699967 NA NA 10 11.80301
```

This gives us the rates \(c\) and \(e\), but where are the transition probabilities? We obtain them with the function `cetotrans`

:

```
cetotrans(c = c_and_e[1], e = c_and_e[4])
#> T01 T10
#> [1,] 0.4285714 0.2307692
```

The equations \(T_{11}= 1 - T_{01}\) and \(T_{00} = 1 - T_{10}\) will give us a complete set of transition probabilities estimated from the data.

There are two main differences between transition probabilities and rates. First, transition probabilities are dimensionless numbers, while rates have dimensions of time\(^{-1}\). Second, with rates we can directly estimate the transition probabilities for any time interval, while transition probabilities are associated with a specific time interval, so if we have \(T_{10}=0.6\) for a specific time interval and we need the transition probability for twice that interval \(T_{10}' \ne 2 \times 0.6\) but if a colonization rate per week is \(c = 0.6 \ week^{-1}\) the colonization rate after 2 weeks would be \(c' = 2 \ weeks \times 0.6 \ week^{-1}= 1.2\). This property allows us to work with irregular sampling schemes —much more common in ecology than regular ones— naturally, simply using a pair of colonization and extinction rates.

The ideal scenario for ecological studies is one in which we sample our subject of study (say the community of fishes on an coral reef) regularly and frequently enough to observe changes in the community. Even though this is rare in ecology, this facilitates the estimation of rates and is less computationally intense. The estimation of rates for these regular sampling schemes has an analytically exact expression, while for irregular sampling schemes we need to rely on heuristic methods or numerical solvers (see section Irregular Sampling Schemes).

Throughout this section, we will use `data(alonso15)`

, a data set of three lists that have information on presence-absence of the community of coral reef fish communities in atolls in the Lakshadweep Archipelago (India). We will use data only from one of the atolls, Kadmat. The table below is an extract of the data:

```
data("alonso15")
<- alonso15[[2]]
df ::kable(head(df)) knitr
```

Species | Trophic.Level | Kd.2000 | Kd.2001 | Kd.2002 | Kd.2003 | Kd.2010 | Kd.2011 | Guild |
---|---|---|---|---|---|---|---|---|

Acanthurus auranticavus | NA | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus leucosternon | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus lineatus | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus nigrofuscus | 2 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus triostegus | 2,78 | 1 | 1 | 1 | 1 | 1 | 1 | Algal Feeder |

Acanthurus xanthopterus | 2,87 | 0 | 0 | 0 | 1 | 1 | 1 | Algal Feeder |

We have several columns of data. The first column of the data.frame lists the species studied, while the second, its associated trophic level. The third to the eighth columns contain presence-absence data of temporal samples collected in consecutive years between 2000 to 2003 and once again in 2010-2011. Finally, the last column has data on the guild of the species studied. For example, the third entry in the data.frame has data for *Acanthurus lineatus*, an algal feeder present in all the years studied.

From here, we can start estimating colonization and extinction rates for the entire community using the function `regular_sampling_scheme`

. First, we have to specify a single data.frame in the function using argument `x =`

and the name of our data.frame. Next, we need to tell the function in which columns the presence-absence data located with the argument `vector =`

, in this case, columns 3 to 6. We will not use the data from 2010 and 2011. So let’s estimate the colonization and extinction rates:

```
<- regular_sampling_scheme(x = df, vector = 3:6)
rates
rates#> c c_up c_low e e_up e_low N NLL
#> 1 0.6034534 NA NA 0.3505777 NA NA 156 276.576
```

As we can see the colonization rate is *c = 0.603* and the extinction rate *e = 0.351*. In addition, the output of the function tells us that we have examined 156 species and that the Negative Log-Likelihood of this model is *NLL = 276.576*.

What if we want to know the rates for each guild? We just have to add to the call the argument `level`

, as in `level = "Guild"`

, indicating the name of the column that divides the data into groups, and `n`

, as in `n = 5`

indicating the minimum number of species that a group needs in order to estimate its rates.

```
<- regular_sampling_scheme(x = df, vector = 3:6, level = "Guild", n = 5)
rates_g
rates_g#> Group c c_up c_low e e_up e_low N NLL
#> 1 Algal Feeder 0.7685355 NA NA 0.16302269 NA NA 30 38.95667
#> 2 Corallivore 0.4427195 NA NA 0.39352848 NA NA 20 35.72338
#> 3 Macroinvertivore 1.0151676 NA NA 0.56465265 NA NA 41 78.09438
#> 4 Microinvertivore 0.4960388 NA NA 0.51204007 NA NA 21 39.36755
#> 5 Omnivore 0.5090504 NA NA 0.40724033 NA NA 9 16.33690
#> 6 Piscivore 0.5263375 NA NA 0.60002472 NA NA 21 40.03434
#> 7 Zooplanktivore 0.5124767 NA NA 0.09189237 NA NA 14 15.93931
```

This calculates the colonization and extinction rates for each guild separately, e.g. corallivores or piscivores, as well as the number of species in each guild and the negative log-likelihood associated with these rates and data.

The estimates of colonization and extinction rates can not be regarded as fixed values, but have an associated uncertainty due, at least in part, to the amount of data we use to calculate the estimates. In interpreting these results we recommend to calculate confidence intervals for the rates we have just estimated. Confidence intervals can be calculated using function `regular_sampling_scheme`

with the argument `CI = T`

. This option provides confidence intervals whose limits have a difference in log-likelihood of 1.96 units around the estimator.

We employ two separate methods to estimate confidence intervals: a sequential method, which is computationally expensive, and a binary search seeded with the Hessian of the likelihood function. The rationale of the sequential method is the following: we start from one of the parameters, say *c*, and we sequentially add a small amount `step`

to the parameter, obtaining the log-likelihood of the new value of the parameter, *c’*, until the difference in log-likelihood between *c* and *c’* climbs above 1.96. This gives us the upper boundary of *c*. The lower boundary is calculated similarly, but with sequential subtraction. This is a straightforward method, but it has to balance computational time and accuracy with the size of the interval `step`

.

The alternative method uses the Hessian of the likelihood function as a seed for a binary search to determine the intervals. After some algebra, the Hessian provides an asymptotic estimator of the confidence interval, thus being a good starting point to look for the actual interval. The method works by calculating the likelihood of this asymptotic estimator and expanding the search (if the difference with our estimate is less than 1.96) or narrowing it (if it is greater than 1.96). This method typically finds the confidence interval in about 10 steps, and is significantly less computationally intense than the sequential method. However, in some cases, the value of the Hessian reduces almost to zero or produces a non-invertible matrix. In these cases, we recommend using argument `step`

.

Coming back to our example, we can calculate confidence intervals either for the entire community or for each studied guilds:

```
<- regular_sampling_scheme(x = df, vector = 3:6, CI = T)
rates2a
rates2a#> c c_up c_low e e_up e_low N NLL
#> 1 0.6034534 0.7527623 0.4763309 0.3505777 0.4483115 0.2686129 156 276.576
# Sequential method
<- regular_sampling_scheme(x = df, vector = 3:6, level = "Guild", n = 1, step = 0.005, CI = T)
rates2b ::kable(rates2b) knitr
```

Group | c | c_up | c_low | e | e_up | e_low | N | NLL |
---|---|---|---|---|---|---|---|---|

Algal Feeder | 0.7685355 | 1.3135355 | 0.4035355 | 0.1630227 | 0.3230227 | 0.0680227 | 30 | 38.95667 |

Corallivore | 0.4427195 | 0.8127195 | 0.2077195 | 0.3935285 | 0.7485285 | 0.1735285 | 20 | 35.72338 |

Macroinvertivore | 1.0151676 | 1.5001676 | 0.6601676 | 0.5646527 | 0.8546527 | 0.3496527 | 41 | 78.09438 |

Microinvertivore | 0.4960388 | 0.8910388 | 0.2410388 | 0.5120401 | 0.9170401 | 0.2520401 | 21 | 39.36755 |

Omnivore | 0.5090504 | 1.2240504 | 0.1540504 | 0.4072403 | 0.9772403 | 0.1222403 | 9 | 16.33690 |

Piscivore | 0.5263375 | 0.9063375 | 0.2763375 | 0.6000247 | 1.1000247 | 0.2850247 | 21 | 40.03434 |

Zooplanktivore | 0.5124767 | 1.1274767 | 0.1774767 | 0.0918924 | 0.2918924 | 0.0118924 | 14 | 15.93931 |

```
# Hessian-based method
<- regular_sampling_scheme(x = df, vector = 3:6, level = "Guild", n = 1, CI = T)
rates2c ::kable(rates2c) knitr
```

Group | c | c_up | c_low | e | e_up | e_low | N | NLL |
---|---|---|---|---|---|---|---|---|

Algal Feeder | 0.7685355 | 1.3122641 | 0.4078379 | 0.1630227 | 0.3195759 | 0.0690967 | 30 | 38.95667 |

Corallivore | 0.4427195 | 0.8123286 | 0.2101798 | 0.3935285 | 0.7439781 | 0.1778156 | 20 | 35.72338 |

Macroinvertivore | 1.0151676 | 1.4990009 | 0.6626292 | 0.5646527 | 0.8532916 | 0.3539581 | 41 | 78.09438 |

Microinvertivore | 0.4960388 | 0.8864932 | 0.2458565 | 0.5120401 | 0.9149993 | 0.2538674 | 21 | 39.36755 |

Omnivore | 0.5090504 | 1.2212077 | 0.1557325 | 0.4072403 | 0.9772242 | 0.1240222 | 9 | 16.33690 |

Piscivore | 0.5263375 | 0.9051379 | 0.2774242 | 0.6000247 | 1.0951876 | 0.2879977 | 21 | 40.03434 |

Zooplanktivore | 0.5124767 | 1.1233754 | 0.1808364 | 0.0918924 | 0.2884887 | 0.0148875 | 14 | 15.93931 |

As we can see, when the number of species in a group is low, uncertainty increases and expands the confidence intervals. In addition, the differences between the two methods are negligible in their output, except for the computational time needed.

A likelihood profile is a simple way to bound the confidence interval of a parameter. Our sequential procedure is clearly based on likelihood profiles, and we will illustrate it using the function `NLL_rss`

, that calculates the NLL of a model given its parameters. There are equivalent functions, `NLL_isd`

, `NLL_imd`

and `NLL_env`

that produce the equivalent output with different sampling schemes or environmental influences on the parameters.

Next, for our case study, we will calculate the likelihood profile for colonization in Kadmat atoll. Function `NLL_rss`

needs a dataframe `x`

, a `vector`

with the number of the columns to analyze, and given `c`

and `e`

rates. The procedure iteratively calculates the likelihood of the parameter given the data around the estimated value. The shape of the likelihood profile gives us an indication on how peaked around the maximum our likelihood functions are.

```
rates#> c c_up c_low e e_up e_low N NLL
#> 1 0.6034534 NA NA 0.3505777 NA NA 156 276.576
<- seq.int(5, 200, by = 5) / 100
seqs <- NLL_rss(x = df, vector = 3:6, c = seqs * rates$c, e = rates$e)
NLLs plot(seqs * rates$c, NLLs, type = "l", xlab = "Colonization rate (year⁻¹)", ylab = "Negative Log-Likelihood", main = "Likelihood profile")
points(rates$c, rates$NLL, pch = 4, col = "magenta")
```

Model selection identifies the best model that fits a given data set. In this package we propose the use of the *Akaike Information Criterion* (AIC) and the associated measure of *weight of evidence* to choose the model that best fits the data and compares it against several alternative models. The functions in package `island`

provide Negative Log-Likelihoods (NLLs) that can easily be transformed to AIC with functions `akaikeic`

or `akaikeicc`

. We can then calculate the `weight_of_evidence`

for each competing model that potentially explains our observations. As a reminder, AICs are only comparable when we try to explain the same data with different models, because it makes no sense comparing the AIC of two different data sets with the same model, since the data is different in both cases.

Let us compare two competing models that try to explain the same data. We will use the data for Kadmat once again, to check if the model that treats each guild’s dynamics separately better explains the data than the model that works with a single colonization and extinction pair of rates for the entire assemblage. For this, we will extract the NLL of both models, calculate their AIC using function `akaikeic`

, and compare the results with function `weight_of_evidence`

.

```
<- sum(rates2c$NLL)
guild_NLL
guild_NLL#> [1] 264.4525
<- rates$NLL
comm_NLL
comm_NLL#> [1] 276.576
<- akaikeic(NLL = guild_NLL, k = 14)
aic_g <- akaikeic(NLL = comm_NLL, k = 2)
aic_c
aic_g#> [1] 556.905
aic_c#> [1] 557.1519
<- data.frame(Model = c("Guilds", "Community"), AIC = c(aic_g, aic_c))
ms_kadmat weight_of_evidence(data = ms_kadmat)
#> Model IncAIC w
#> 1 Guilds 0.000000 0.5308212
#> 2 Community 0.246883 0.4691788
```

The output indicates that the log-likelihood of the model with different dynamics for each guild is over 12 points better than the model with a single dynamic group, which means that considering the guilds separately better explains the data than lumping together all guilds in a single group. However, the “guilds” model has 14 parameters while the “community” model has only 2. It is unsurprising that the more parameters the model has, the better it explains the data. For this reason, we calculate the AIC of each model using function `akaikeic`

, that, to preserve parsimony, accounts for the number of parameters with the argument `k`

. AIC values suggest that the “guilds” model is only marginally better than the “community” model, with a difference of 0.25. This is a tiny difference, and we cannot conclusively say that one of the models has more support than the other. This analysis is backed by the function `weight_of_evidence`

, that provides a measure of the support that each of the competing models has. We found that the “guilds” model has a weight of evidence of 53% while the “community” model is 47% and, as already noted, we cannot conclude that any model is significantly better. We will explore this dataset in some more detail in the section *A short example*.

The package `island`

allows to simulate ecological data driven by colonization and extinction dynamics. Functions `data_generation`

and `PA_simulation`

allow us to generate species richness or presence-absence data from an initial vector of presence-absence for specific transition probabilities that can vary between different simulated transitions. As an example, we will simulate the temporal evolution in richness for Kadmat atoll.

```
### First, take a look at the rates for Kadmat
<- cetotrans(c = rates$c, e = rates$e)
tp
set.seed(10110111)
<- PA_simulation(x = df, column = 3, transitions = tp, times = 11)
sim_1 colSums(sim_1)
#> [1] 88 88 100 106 100 103 98 97 102 99 101
<- data_generation(x = df, column = 3, transitions = tp, iter = 999, times = 11)
sims 1]
sims[, #> [1] 85 98 91 97 102 97 92 93 96 100 102
<- apply(X = sims, MARGIN = 1, FUN = quantile, c(0.025, 0.975))
ic
ic#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#> 2.5% 80 83.95 85 86 87 86 87 87 87 87 86
#> 97.5% 102 107.00 109 109 109 110 109 109 110 109 110
rates #> c c_up c_low e e_up e_low N NLL
#> 1 0.6034534 NA NA 0.3505777 NA NA 156 276.576
```

In the simulation above, we use colonization and extinction rates estimated with the first 4 samples to simulate community dynamics from 2000 to 2011. We can use this to evaluate how reliably our parameter estimates of colonization-extinction dynamics fit observed data. The plot above shows the actual observations, the result of a simulation (as a black dotted line) and the 95% distribution of the simulations of the dynamics (the area enclosed within the two red dashed lines). The actual data falls well within the boundaries delimited by the simulation except for the last sample.

Finally, we will estimate goodness-of-fit calculating model R^2. Any estimation of R^2 requires a null model to compare with. R^2 measures how well our model performs in relation to a null model of choice. As a null model, here we assume that at any given time we can have a number of species present from 0 to \(S_P\), the number of species in the pool, which is drawn from a uniform probability distribution. In order to estimate goodness-of-fit, we use the function `r_squared`

, which requires us to specify the `observed`

species richness, `simulated`

species richness and the number of species in the pool `sp`

.

```
<- apply(sims, 1, quantile, 0.5)
simulated
simulated #> [1] 91 96 97 98 99 98 99 98 98 99 99
<- simulated[c(1:3, 10, 11)]
simulated
<- colSums(df[, 3:8])
observed
observed#> Kd.2000 Kd.2001 Kd.2002 Kd.2003 Kd.2010 Kd.2011
#> 79 91 100 95 103 120
# We skip the first observation since we have to use it as the starting point of the simulations.
<- observed[-1]
observed
r_squared(observed = observed, simulated = simulated, sp = 156)
#> R-Squared
#> 0.9651524
```

The output indicates that we have a good fit compared to our null model. However, changes in the null model would lead to different estimates of R². For this reason, we also include function `simulated_model`

that produces the quadratic error of a model given some data. We can estimate a new R² using a different null model and estimating its error with the previous function using the equation below: \[ R^2 = 1 - \frac{\epsilon^2}{\epsilon^2 _0 }\]

Given the complexities and challenges inherent to collecting real-world ecological data, temporal monitoring is often patchy, making irregular sampling schemes common in ecology. Data typically includes one or even several data sets with different sampling schemes. To accommodate the messiness of real-world data, ‘island’ has two functions `irregular_single_dataset`

and `irregular_multiple_datasets`

. Irregular sampling schemes force to alter the way in which we calculate colonization and extinction rates, precluding the use of the algebraic (exact) method for estimating rates. We can still calculate the likelihood of the dynamic model using two methods with different approaches: a *heuristic* method and a *semianalytical* method.

The *heuristic* or *semianalytical* approaches of calculating rates are implemented in functions `irregular_single_dataset`

and `irregular_multiple_dataset`

, and we can switch between approaches using the argument `jacobian`

.

The *heuristic* method uses R’s built-in optimization routine, `optim`

, to obtain estimates for the colonization and extinction rates. The function `optim`

uses an implementation of the Nelder-Mead algorithm to find the optimum of the likelihood function. However, *heuristic* methods do not guarantee finding the global optima of the objective function, and they do not have a mechanism to evaluate how good the optima are for the selected point.

In contrast, the *semianalytical* method guarantees that the optima calculated is a root of the determinant of the Jacobian matrix of the likelihood function when the algorithm converges. In this package, we call routine `multiroot`

from package `rootSolve`

in order to find the semianalytical optimum value for our likelihood function.

Even though we encourage the use of the Jacobian-based method, it may not always converge. The function will notify possible problems, and we recommend using the heuristic method if difficulties are encountered. One way to find good initial values for the Jacobian-based method is to use the estimates of the heuristic method as a starting point.

The function that deals with an irregular sampling scheme in a single data set is called `irregular_single_dataset`

. An example, partially shown below (some samples have been skipped) and similar to the one reproduced earlier, is data of island ST2 in `data(simberloff)`

:

Taxa | PRE | 21 | 40 | 59 | 231 | 250 | 322 | Tax. Unit 1 | Tax. Unit 2 | Genera | Island |
---|---|---|---|---|---|---|---|---|---|---|---|

Diradius caribbeana | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Embioptera | Teratembiidae | Diradius | ST2 |

Latiblattella n. sp. | 1 | 0 | 1 | 1 | 1 | 1 | 1 | Orthoptera | Blattidae | Latiblattella | ST2 |

Cycloptilum sp. | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Orthoptera | Gryllidae | Cycloptilum | ST2 |

Cyrtoxipha sp. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Orthoptera | Gryllidae | Cyrtoxipha | ST2 |

Tafalisca lurida | 1 | 0 | 0 | 0 | 1 | 1 | 1 | Orthoptera | Gryllidae | Tafalisca | ST2 |

Kalotermes joutelli | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Isoptera | Kalotermitidae | Kalotermes | ST2 |

Functions `irregular_single_dataset`

as well as `irregular_multiple_datasets`

reproduce all the functionalities of `regular_sampling_schemes`

. Please note that `irregular_single_dataset`

requires numbers for the names of the columns with presence-absence data. The only other difference with the function for the regular sampling schemes is that we need to enter priors for *c* and *e*, our colonization and extinction rates, using arguments `c =`

and `e =`

.

Below we illustrate the use of this function with the data for island ST2 of data set `simberloff`

. This data set was published by Simberloff and Wilson (1969). We start by estimating colonization and extinction rates for the whole island.

```
<- simberloff[[6]]
st2
# Before we begin, it is useful to look at the data for obvious errors.
head(st2)
#> Taxa PRE 21 40 59 77 94 113 134 149 171 191 211 231 250 322
#> 1 Diradius caribbeana 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
#> 2 Latiblattella n. sp. 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 3 Cycloptilum sp. 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0
#> 4 Cyrtoxipha sp. 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
#> 5 Tafalisca lurida 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1
#> 6 Kalotermes joutelli 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#> Tax. Unit 1 Tax. Unit 2 Genera Island
#> 1 Embioptera Teratembiidae Diradius ST2
#> 2 Orthoptera Blattidae Latiblattella ST2
#> 3 Orthoptera Gryllidae Cycloptilum ST2
#> 4 Orthoptera Gryllidae Cyrtoxipha ST2
#> 5 Orthoptera Gryllidae Tafalisca ST2
#> 6 Isoptera Kalotermitidae Kalotermes ST2
# Presence-absence data is in columns 3 to 16.
<- irregular_single_dataset(dataframe = st2, vector = 3:16, c = 0.001, e = 0.001, assembly = T, jacobian = T)
rates.st2
rates.st2#> c c_up c_low e e_up e_low N NLL
#> 1 0.00670301 NA NA 0.0127161 NA NA 80 457.6403
# We will now estimate rates for different groups in "Tax. Unit 1" with more than 5 species.
<- irregular_single_dataset(dataframe = st2, vector = 3:16, c = 0.001, e = 0.001, column = "Tax. Unit 1", n = 5, assembly = T, jacobian = T)
rates.groups
rates.groups#> Group c c_up c_low e e_up e_low N NLL
#> 1 Acarina 0.004820490 NA NA 0.008007749 NA NA 8 35.63007
#> 2 Araneae 0.008426681 NA NA 0.017747947 NA NA 11 70.27852
#> 3 Coleoptera 0.006089566 NA NA 0.014810771 NA NA 8 43.01984
#> 4 Hymenoptera 0.007776273 NA NA 0.010184661 NA NA 20 122.89718
#> 5 Lepidoptera 0.007806117 NA NA 0.012146876 NA NA 8 49.23273
#> 6 Psocoptera 0.008107215 NA NA 0.017269065 NA NA 5 30.16782
# Plotting the results.
plot(rates.groups[, 2], rates.groups[, 5], xlab = "Colonization rate (day⁻¹)", ylab = "Extinction rate (day⁻¹)", main = "ST2")
points(rates.st2[, 1], rates.st2[, 4], pch = 4)
<- list(x = c(0.004902829, 0.008421634, 0.006021524, 0.007679227, 0.007892796, 0.008025005, 0.006692741), y = c(0.008578302, 0.017145964, 0.014431104, 0.009668973, 0.011560750, 0.016878929, 0.011553715))
loc
text(loc, c(substr(rates.groups[, 1], 1, 3), "Whole \ncommunity"))
```

We can also produce transition probabilities for different time intervals and simulate these time intervals, allowing a more efficient use of processor time and memory. Functions `cetotrans`

and `data_generation`

are optimized to use with irregular time intervals. In the case of function `cetotrans`

, arguments `c`

, `e`

and `dt`

must be vectors of the same length, while for function `data_generation`

arguments `transitions`

and `times`

should have the same number of rows and the same value, respectively. We simulate colonization and extinction dynamics for island ST2 300 times, and we observe that species richness reaches a steady-state after a while.

```
<- as.numeric(colnames(st2)[4:16]) - as.numeric(colnames(st2)[3:15])
dts <- c(21, dts)
dts
<- cetotrans(c = rep(rates.st2[[1]], length(dts)), e = rep(rates.st2[[4]], length(dts)), dt = dts)
tps
tps#> T01 T10
#> [1,] 0.2192933 0.11559563
#> [2,] 0.2020453 0.10650373
#> [3,] 0.2020453 0.10650373
#> [4,] 0.1931669 0.10182363
#> [5,] 0.1841143 0.09705176
#> [6,] 0.2020453 0.10650373
#> [7,] 0.2192933 0.11559563
#> [8,] 0.1654731 0.08722548
#> [9,] 0.2276694 0.12001087
#> [10,] 0.2107531 0.11109382
#> [11,] 0.2107531 0.11109382
#> [12,] 0.2107531 0.11109382
#> [13,] 0.2020453 0.10650373
#> [14,] 0.4930516 0.25990123
<- data_generation(x = matrix(0, 80, 1), column = 1, transitions = tps, iter = 300, times = length(dts))
sims <- apply(sims, 1, quantile, probs = c(0.025, 0.975))
ic
ic#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> 2.5% 4 8.000 11 14.000 15 18.000 17 18 18 19.000 19 20
#> 97.5% 15 21.525 25 29.525 31 33.525 35 35 36 35.525 35 36
#> [,13] [,14]
#> 2.5% 19 20
#> 97.5% 36 37
<- apply(sims, 1, median)
med.st2
<- c(0, colnames(st2)[3:16])
days
plot(days, c(0, colSums(st2[, 3:16])), ylim = c(0, 40), xlab = "Days since defaunation", ylab = "Species richness", main = "ST2", type = "b")
lines(days, c(0, med.st2), col = "green", lty = 2)
lines(days, c(0, ic[1, ]), col = "magenta", lty = 3)
lines(days, c(0, ic[2, ]), col = "magenta", lty = 3)
```

Sampling schemes for similar communities can be very different due to multiple reasons usually associated to the challenges of ecological research. The function `irregular_multiple_datasets`

allows us to use data from different sampling schemes and estimate joint parameters for these data sets. In order to use these data sets, they need to comply with the general structure for irregular sampling schemes and they should be combined in a list. The argument `list`

will collate the data sets we want to study jointly and the argument `vectorlist`

must contain the vectors (ordered in time) that indicate where the presence-absence data is located. The remaining arguments work as discussed in earlier sections. Next we estimate colonization and extinction rates for each island and each taxonomic group in the whole dataset `simberloff`

.

```
<- irregular_multiple_datasets(list = simberloff, vectorlist = list(3:17, 3:18, 3:17, 3:19, 3:17, 3:16), c = 0.001, e = 0.001, jacobian = T)
rates.simberloff <- irregular_multiple_datasets(list = simberloff, vectorlist = list(3:17, 3:18, 3:17, 3:19, 3:17, 3:16), c = 0.001, e = 0.001, column = "Island", n = 5, jacobian = T)
rates.islands <- irregular_multiple_datasets(list = simberloff, vectorlist = list(3:17, 3:18, 3:17, 3:19, 3:17, 3:16), c = 0.0001, e = 0.0001, column = "Tax. Unit 1", n = 20, jacobian = T)
rates.taxonomy
rates.simberloff#> c c_up c_low e e_up e_low N NLL
#> 1 0.005102348 NA NA 0.01175197 NA NA 549 2938.645
rates.islands#> Group c c_up c_low e e_up e_low N NLL
#> 1 E1 0.003849172 NA NA 0.008577194 NA NA 54 226.9079
#> 2 E2 0.005353969 NA NA 0.013019739 NA NA 119 654.2403
#> 3 E3 0.005637932 NA NA 0.011374938 NA NA 92 486.9761
#> 4 E7 0.004133691 NA NA 0.011507033 NA NA 99 600.8253
#> 5 E9 0.005261473 NA NA 0.011595805 NA NA 105 526.3934
#> 6 ST2 0.006901374 NA NA 0.012774564 NA NA 80 433.5353
rates.taxonomy#> Group c c_up c_low e e_up e_low N NLL
#> 1 Acarina 0.005007757 NA NA 0.010790814 NA NA 48 240.7149
#> 2 Araneae 0.005658833 NA NA 0.016369729 NA NA 85 488.8871
#> 3 Coleoptera 0.004985066 NA NA 0.011884045 NA NA 65 349.9689
#> 4 Hemiptera 0.006196764 NA NA 0.019892102 NA NA 34 185.8024
#> 5 Hymenoptera 0.004933051 NA NA 0.011527283 NA NA 131 689.1720
#> 6 Lepidoptera 0.005971425 NA NA 0.007636927 NA NA 44 241.2960
#> 7 Orthoptera 0.005429715 NA NA 0.005220060 NA NA 24 122.8997
#> 8 Psocoptera 0.005809177 NA NA 0.016437311 NA NA 46 267.1539
plot(rates.taxonomy[, 2], rates.taxonomy[, 5], xlim = c(.0034, .0074), ylim = c(.0025, .0225), pch = 20, main = "Simberloff dataset", xlab = "Colonization rate (day⁻¹)", ylab = "Extinction rate (day⁻¹)")
points(rates.islands[, 2], rates.islands[, 5], pch = 4)
points(rates.simberloff[, 1], rates.simberloff[, 4], pch = 3)
<- list(x = c(0.005136858, 0.003858476, 0.005435459, 0.005724728, 0.004026439, 0.005379471, 0.006919129, 0.005015552, 0.005594090, 0.004912908, 0.006303266, 0.004722939, 0.006069984, 0.005435459, 0.005892691), y = c(0.01074532, 0.007403081, 0.014011605, 0.010289563, 0.012644324, 0.010593403, 0.011353003, 0.009985722, 0.017429807, 0.012872204, 0.018645168, 0.011201083, 0.006415600, 0.004212758, 0.015454846))
locs text(locs, c("TIB", as.character(rates.islands$Group), substr(rates.taxonomy[, 1], 1, 3)),
col = c("black", rep("magenta", 6), rep("green", 8)))
```

The rates obtained indicate differences among islands, specially in colonization, while the different taxonomic groups show different rates.

In this section, we will reproduce, step-by-step, some of the results presented in the article by Alonso *et al.* (2015), on the recolonization process of a coral reef fish community in the Lakshadweep Archipelago after a coral mass mortality event. This article shows that higher trophic guilds suffer increased rates of extinction even in the absence of targeted fishing.

The article estimated colonization and extinction rates for 4 competing models to explain the dynamics of the coral fish community, performed model selection among those models and simulated the dynamics of different trophic guilds in each island. The four models were: A, the different guilds in the three sampled islands behaved as a single community; B, each island had a different dynamics; C, each guild had a different dynamics, but did not vary between islands; and D, each guild had a different dynamics in each island. Below, we reproduce part of Table 1 of the article, showing the model selection procedure, Figure 3, showing the transition probabilities found by the best model, and a plot for the temporal evolution of corallivorous fishes in Agatti atoll, included in the original article in Figure 2.

```
data(alonso15)
head(alonso15[[2]]) # Examine the data
#> Species Trophic.Level Kd.2000 Kd.2001 Kd.2002 Kd.2003 Kd.2010
#> 1 Acanthurus auranticavus <NA> 1 1 1 1 1
#> 2 Acanthurus leucosternon 2 1 1 1 1 1
#> 3 Acanthurus lineatus 2 1 1 1 1 1
#> 4 Acanthurus nigrofuscus 2 1 1 1 1 1
#> 5 Acanthurus triostegus 2,78 1 1 1 1 1
#> 6 Acanthurus xanthopterus 2,87 0 0 0 1 1
#> Kd.2011 Guild
#> 1 1 Algal Feeder
#> 2 1 Algal Feeder
#> 3 1 Algal Feeder
#> 4 1 Algal Feeder
#> 5 1 Algal Feeder
#> 6 1 Algal Feeder
# To begin, we calculate rates for each island (model B), and for each group in each island (model D).
<- regular_sampling_scheme(x = alonso15[[1]], vector = 3:6)
rates.aga <- regular_sampling_scheme(x = alonso15[[2]], vector = 3:6)
rates.kad <- regular_sampling_scheme(x = alonso15[[3]], vector = 3:5)
rates.kav
<- rates.aga$NLL + rates.kad$NLL + rates.kav$NLL
ModelB
<- regular_sampling_scheme(x = alonso15[[1]], vector = 3:6, level = "Guild", n = 1)
rates.aga.guild <- regular_sampling_scheme(x = alonso15[[2]], vector = 3:6, level = "Guild", n = 1)
rates.kad.guild <- regular_sampling_scheme(x = alonso15[[3]], vector = 3:5, level = "Guild", n = 1)
rates.kav.guild
<- sum(rates.aga.guild$NLL) + sum(rates.kad.guild$NLL) + sum(rates.kav.guild$NLL)
ModelD
# Next, we calculate rates using the function irregular_multiple_datasets. We need to first change the colnames of the columns with data.
colnames(alonso15[[1]])[3:6] <- 2000:2003
colnames(alonso15[[2]])[3:6] <- 2000:2003
colnames(alonso15[[3]])[3:5] <- 2001:2003
<- irregular_multiple_datasets(list = alonso15, vectorlist = list(3:6, 3:6, 3:5), c = 0.1, e = 0.1, jacobian = T, CI = T)
rates.lak
<- irregular_multiple_datasets(list = alonso15, vectorlist = list(3:6, 3:6, 3:5), c = 0.1, e = 0.1, jacobian = T, column = "Guild", n = 1, CI = T)
rates.guild
# We can now follow the model selection procedure used in the article.
<- rates.lak$NLL
ModelA <- sum(rates.guild$NLL)
ModelC
<- c(ModelA, ModelB, ModelC, ModelD)
NLL
NLL#> [1] 742.3037 741.1601 725.4704 710.7865
<- c(2, 6, 14, 42)
Parameters <- rep(1248, 4)
N
<- akaikeic(NLL = NLL, k = Parameters)
AIC <- akaikeicc(NLL = NLL, k = Parameters, n = 1248)
AICc
<- weight_of_evidence(data = data.frame(Model = c("A", "B", "C", "D"), AICc = AICc))
Table1 <- cbind(Table1, AIC, AICc, NLL, Parameters, N)
Table1 ::kable(Table1) knitr
```

Model | IncAIC | w | AIC | AICc | NLL | Parameters | N |
---|---|---|---|---|---|---|---|

A | 9.335557 | 0.0093009 | 1488.607 | 1488.617 | 742.3037 | 2 | 1248 |

B | 15.106378 | 0.0005193 | 1494.320 | 1494.388 | 741.1601 | 6 | 1248 |

C | 0.000000 | 0.9901794 | 1478.941 | 1479.282 | 725.4704 | 14 | 1248 |

D | 29.289027 | 0.0000004 | 1505.573 | 1508.571 | 710.7865 | 42 | 1248 |

```
# The following steps are used to reproduce Figure 3 in Alonso *et al* (2015).
rates.guild#> Group c c_up c_low e e_up e_low
#> 1 Algal Feeder 0.6314581 0.8750604 0.4396526 0.1913754 0.2873993 0.1198905
#> 2 Corallivore 0.4505997 0.6573846 0.2938969 0.2743163 0.4345424 0.1598409
#> 3 Macroinvertivore 0.5286944 0.6873453 0.3977289 0.4329315 0.5721428 0.3191255
#> 4 Microinvertivore 0.4823432 0.6950086 0.3201004 0.5565499 0.8014445 0.3698255
#> 5 Omnivore 0.5107695 0.8562513 0.2773004 0.2907457 0.5674376 0.1238472
#> 6 Piscivore 0.6428574 0.8878366 0.4518301 0.6607665 0.9631957 0.4321006
#> 7 Zooplanktivore 0.7924913 1.1952252 0.4994883 0.2661351 0.4598341 0.1371437
#> N NLL
#> 1 90 116.33150
#> 2 60 89.39827
#> 3 123 201.79017
#> 4 63 105.29037
#> 5 27 41.50041
#> 6 63 110.06134
#> 7 42 61.09837
# transform rates into transition probabilities.
<- cetotrans(c = rates.guild$c, e = rates.guild$e)
tps
plot(tps[, 2], tps[, 1], xlim = c(0, 0.7), ylim = c(0, 0.5), type = "n", xlab = "Colonization probability", ylab = "Extinction probability", main = "Lakshadweep guilds")
text(tps[, 2], tps[, 1], c("A", "C", "M", "m", "O", "P", "Z"))
text(0.07, 0.25, "Trophic level", srt = 90, col = "magenta")
arrows(0.1, 0.1, 0.1, 0.4, code = 2, col = "green")
```

```
rates.aga.guild#> Group c c_up c_low e e_up e_low N NLL
#> 1 Algal Feeder 0.5058664 NA NA 0.2360710 NA NA 30 49.00071
#> 2 Corallivore 0.5902927 NA NA 0.2459553 NA NA 20 33.70719
#> 3 Macroinvertivore 0.3438327 NA NA 0.3625524 NA NA 41 69.42637
#> 4 Microinvertivore 0.5865863 NA NA 0.7502420 NA NA 21 41.07112
#> 5 Omnivore 0.9037842 NA NA 0.2397795 NA NA 9 14.71404
#> 6 Piscivore 0.6161308 NA NA 0.7701635 NA NA 21 41.12469
#> 7 Zooplanktivore 0.8724753 NA NA 0.4800701 NA NA 14 26.87099
<- cetotrans(c = rates.aga.guild[2, 2], e = rates.aga.guild[2, 5])
tps
<- alonso15[[1]][alonso15[[1]]$Guild == "Corallivore", ]
cor.agatti <- colSums(cor.agatti[, 3:8])
richness
richness#> 2000 2001 2002 2003 A.2010 A.2011
#> 7 11 12 14 14 18
<- data_generation(x = cor.agatti, column = 3, transitions = tps, iter = 300, times = 11)
sims <- apply(X = sims, MARGIN = 1, FUN = quantile, probs = c(0.025, 0.975))
sims.ic
plot(c(2000:2003, 2010, 2011), richness, type = "b", ylim = c(0, 21), col = "red", xlab = "Time (Year)", ylab = "Species", main = "Corallivore/AGATTI")
lines(c(2000:2011), c(7, sims.ic[1, ]), lty = 2)
lines(c(2000:2011), c(7, sims.ic[2, ]), lty = 2)
```

Apart from several biotic factors (e.g.. the presence of top predators or the abundance of primary producers), the number of species in a community can also be influenced by several abiotic factors (e.g.. temperature or rainfall) - referred to as environmental covariates. The package `island`

includes functions `all_environmental_fit`

, `greedy_environmental_fit`

, `custom_environmental_fit`

and `rates_calculator`

to analyze the influence of environmental covariates on the colonization and extinction dynamics of ecological communities. The functions assume a linear functional response for environmental covariates, \[ c_t = \alpha_0 + \sum_{i=1}^F \alpha_{i} Y_{it}, \quad e_t = \beta_0 + \sum_{i=1}^F \beta_{i} Y_{it},\] where \(Y_{it}\) is the value of the \(i\)-th environmental covariate (\(i=1,\dots,F\)) at time \(t\), since it is the simplest way to build this dependency into colonization and extinction rates.

We employ R expressions in these functions. An `expression`

is an R object, frequently a call, symbol or constant, that has to be evaluated prior to its use. We have minimized the use of expressions to make certain function calls easier to understand. However we make use of them internally and provide function `custom_environmental_fit`

that needs two expressions in order to hone the results from `all_environmental_fit`

and `greedy_environmental_fit`

as well as developing custom dependencies with environmental covariates that can be specified by advanced users.

All three functions for environmental fit need a single argument `dataset`

with the same structure as the one for irregular schemes and an argument `vector`

indicating the columns containing presence absence data. In addition, all three functions need another `data.frame`

with related environmental covariates in columns. The names of these columns have to be specified to functions `all_environmental_fit`

and `greedy_environmental_fit`

with the argument `env`

. These two functions also need arguments `c`

, `e`

and `aic`

, this is, tentative values for colonization and extinction rates and a tentative AIC value for the model. In practice, this AIC value should be chosen very large (of the order of 10⁸ but this depends on the size of the data set). The difference between `all_environmental_fit`

and `greedy_environmental_fit`

is that the latter employs a greedy algorithm that sequentially adds environmental covariates, one at a time, to the previously best set (using AIC), i. e., the algorithm finds first the best environmental covariate and then the best combination with two covariates including the previously selected, and so forth until AIC does not justify the addition of new covariates. In contrast `all_environmental_fit`

tries all combinations of environmental variables, starting with the minimum number of environmental covariates to the maximum. Since the number of combinations rapidly becomes unmanageable, this method is not recommended except when we have few environmental variables.

We illustrate the use of these functions using `data(idaho)`

. This is a long-term time series of a mapped plant community in Dubois, Idaho (USA). The plant community of a sagebrush steppe was mapped and identified from 1923 to 1973, using permanent 1-m² quadrats, and precipitation, mean temperature and snowfall were measured monthly (we added the annual mean for each one). The data included in this package is a sample of the full data set corresponding to several quadrats surveyed from 1932 to 1955 with some gaps. The following code explores the relation of this plant community with total annual precipitation and mean monthly temperature.

```
data(idaho)
<- idaho[[1]]
df <- idaho[[2]]
env
# Examine the colnames.
colnames(df)
#> [1] "quad" "species" "32" "33" "34" "35" "36"
#> [8] "37" "38" "39" "40" "41" "42" "45"
#> [15] "46" "47" "49" "50" "51" "52" "54"
#> [22] "55" "56"
colnames(env)
#> [1] "YEAR" "JAN.ppt" "FEB.ppt" "MAR.ppt" "APR.ppt"
#> [6] "MAY.ppt" "JUN.ppt" "JUL.ppt" "AUG.ppt" "SEP.ppt"
#> [11] "OCT.ppt" "NOV.ppt" "DEC.ppt" "TOTAL.ppt" "JAN.temp"
#> [16] "FEB.temp" "MAR.temp" "APR.temp" "MAY.temp" "JUN.temp"
#> [21] "JUL.temp" "AUG.temp" "SEP.temp" "OCT.temp" "NOV.temp"
#> [26] "DEC.temp" "ANNUAL.temp" "JAN.snow" "FEB.snow" "MAR.snow"
#> [31] "APR.snow" "MAY.snow" "JUN.snow" "JUL.snow" "AUG.snow"
#> [36] "SEP.snow" "OCT.snow" "NOV.snow" "DEC.snow" "TOTAL.snow"
# Estimate the influence of temperature and precipitation on colonization and extinction dynamics. We first include the call to all_environmental_fit but we do not run it because of computation limits. We then use the greedy algorithm to find a good dependency.
<- greedy_environmental_fit(dataset = df, vector = 3:23, env = c("env$TOTAL.ppt", "env$ANNUAL.temp"), c = 0.13, e = 0.19, aic = 100000)
envfit
envfit#> $Colonization
#> expression(params[1] * env$TOTAL.ppt[i] + params[3])
#>
#> $Extinction
#> expression(params[2] * env$ANNUAL.temp[i] + params[4])
#>
#> $Results
#> $Results$par
#> [1] -0.00497925 -0.01729602 0.19006501 0.93486956
#>
#> $Results$value
#> [1] -1874.859
#>
#> $Results$counts
#> function gradient
#> 215 NA
#>
#> $Results$convergence
#> [1] 0
#>
#> $Results$message
#> NULL
all_environmental_fit(dataset = df, vector = 3:23, env = c("env$TOTAL.ppt", "env$ANNUAL.temp"), c = 0.13, e = 0.19, aic = 100000)
#> $Colonization
#> expression(params[1] * env$TOTAL.ppt[i] + params[3])
#>
#> $Extinction
#> expression(params[2] * env$ANNUAL.temp[i] + params[4])
#>
#> $Results
#> $Results$par
#> [1] -0.00497925 -0.01729602 0.19006501 0.93486956
#>
#> $Results$value
#> [1] -1874.859
#>
#> $Results$counts
#> function gradient
#> 215 NA
#>
#> $Results$convergence
#> [1] 0
#>
#> $Results$message
#> NULL
```

The greedy algorithm identified a model that included precipitation in the expression for colonization and temperature in the expression for extinction. The output of the function is a list that indicates that its first component is the expression for the colonization, the second one is the expression for the extinction, and the third one is the result of the optimizer for these expressions; in this case, the output of the function indicates that the colonization rate depends on precipitation (with a coefficient of -0.00497925 and an intercept of 0.19006501), the extinction rate depends on temperature (with a coefficient of -0.01729602 and an intercept of 0.93486956), and it has a log-likelihood of 1874.859 that was found after 215 calls to the corresponding likelihood function. This result was backed by function `all_environmental_fit`

, that found the same dependency with the environmental covariates.

We will now use the function `custom_environmental_fit`

to demonstrate its use and improve the result we found earlier. Besides arguments `dataset`

and `vector`

, this function requires an argument `params`

that corresponds to the parameters estimated for expressions `c_expression`

and `e_expression`

, for colonization and extinction. Next, we use `rates_calculator`

, that uses arguments `params`

, `c_expression`

and `e_expression`

as in the previous function plus argument `t`

that indicates the number of colonization and extinction rates needed. This function calculates the actual rates at each time in order to be able to simulate the dynamics of the colonization and extinction process resulting from these parameters. Please note that these rates have dimensions of time⁻¹, so when converting rates to transition probabilities we have to indicate the interval of time between samples for each rate.

```
custom_environmental_fit(dataset = df, vector = 3:23, params = envfit$Results$par, c_expression = envfit$Colonization, e_expression = envfit$Extinction)
#> $par
#> [1] -0.004973345 -0.017281488 0.190021313 0.934197043
#>
#> $value
#> [1] -1874.859
#>
#> $counts
#> function gradient
#> 97 NA
#>
#> $convergence
#> [1] 0
#>
#> $message
#> NULL
<- rates_calculator(params = envfit$Results$par, c_expression = envfit$Colonization, e_expression = envfit$Extinction, t = 21)
rates.env head(rates.env)
#> c e
#> [1,] 0.1398742 0.23293924
#> [2,] 0.1448036 0.19041984
#> [3,] 0.1491356 0.08866157
#> [4,] 0.1445547 0.18681650
#> [5,] 0.1462476 0.17917743
#> [6,] 0.1460484 0.21189574
<- as.numeric(colnames(df)[4:23]) - as.numeric(colnames(df)[3:22])
dts
dts#> [1] 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 1 1 2 1 1
<- cetotrans(c = rates.env[-21, 1], e = rates.env[-21, 2], dt = dts)
tps <- data_generation(x = df, column = 3, transitions = tps, iter = 100, times = 20)
sims <- apply(sims, 1, quantile, probs = c(0.025, 0.975))
sims.ic <- apply(sims, 1, quantile, probs = .5)
med.ida
sims.ic#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> 2.5% 52.950 60.950 73.475 75.475 79.95 80.000 74.475 79 80.000 76.95
#> 97.5% 73.525 84.525 98.000 101.000 106.05 104.525 101.050 105 102.525 106.00
#> [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 2.5% 74.475 70.000 71.475 75.425 75.95 75.00 72 71.00 74.475 73.00
#> 97.5% 100.575 94.525 97.000 101.525 104.05 99.05 98 100.05 106.525 100.05
<- 1900 + as.numeric(colnames(df)[3:23])
days
plot(days, colSums(df[, 3:23]), type = "b", ylim = c(50, 115), pch = 4, ylab = "Accumulated species richness", xlab = "Year", main = "Sagebrush steppe Dubois, ID, USA")
lines(days, c(57, sims.ic[1, ]), col = "magenta", lty = 3)
lines(days, c(57, sims.ic[2, ]), col = "magenta", lty = 3)
lines(days, c(57, med.ida), col = "green", lty = 2)
```

This vignette comprehensively illustrates most of the functionality of the package `island`

. This package allows the estimation of colonization and extinction rates for whole communities or groups of species under two basic assumptions: *species independence* and *species equivalence*. It allows the user to employ typically messy ecological monitoring data with a variety of sampling schemes. It also enables explorations of the influence of environmental variables on colonization and extinction rates, and helps identify the best model explaining data using a model selection procedure. Finally, it simulates the dynamics of estimated parameters for simple stochastic models of island biogeography. For more details, take a look at the other vignettes of the package.