Hostname: page-component-848d4c4894-p2v8j Total loading time: 0 Render date: 2024-06-11T03:46:11.793Z Has data issue: false hasContentIssue false

Statistical analysis of species association indices

Published online by Cambridge University Press:  17 April 2024

Manuel Mendoza
Affiliation:
Departamento de Estadística, Instituto Tecnológico Autónomo de México, Ciudad de México, México
Eduardo Mendoza*
Affiliation:
Instituto de Investigaciones sobre los Recursos Naturales, Universidad Michoacana de San Nicolás de Hidalgo, Morelia, Michoacán, México
E. Gutiérrez-Peña
Affiliation:
Departamento de Probabilidad y Estadística, Universidad Nacional Autónoma de México, Ciudad de México, México
*
Corresponding author: Eduardo Mendoza; Email: eduardo.mendoza@umich.mx
Rights & Permissions [Opens in a new window]

Abstract

The study of species association is of great interest in ecology due to its role in understanding key issues such as patterns of habitat use by animals, species coexistence, biotic interactions, and in general factors affecting community structure and assembly. There are many indices that ecologists commonly use, all based on the observed frequencies of organism occurrences, to evaluate the association between a pair of species. However, few of these indices correspond to proper statistical measures of association, and the inferential aspects of their analysis are often overlooked. In this paper, we propose a Bayesian approach based on a simple multinomial-Dirichlet structure to provide a comprehensive inferential framework for any set of association indices. Our approach provides a full statistical analysis for any association index of interest, free of special requirements on the sample size. We illustrate our procedure with a camera-trapping real-dataset, but the analysis of any other dataset of the same type can be readily produced using the R package basa that accompanies this paper.

Type
Method
Copyright
© The Author(s), 2024. Published by Cambridge University Press

Introduction

The question of whether two species tend to occur together, or finding one of them precludes the presence of the other, has occupied the attention of ecologists since the beginning of the past century (Dice Reference Dice1945). The study of species association patterns constitutes a central issue in ecology due to its role in understanding how species interact and how these interactions are influenced by changes in biotic and abiotic factors, ultimately shaping community characteristics (Lavender et al. Reference Lavender, Schamp, Arnott and Rusak2019). Because of this interest, a variety of approaches have been proposed to analyze the relevant information, each usually introducing a different numerical measure of association (Calvert Reference Calvert1922, Dice Reference Dice1945, Forbes Reference Forbes1907, Michael Reference Michael1920, Pielou Reference Pielou1967).

Hubálek (Reference Hubálek1982) reviewed 43 association coefficients and concluded that only 6 of them were admissible according to a set of properties that he determined. Legendre and Legendre (Reference Legendre and Legendre1998) reviewed several measures of association and proposed a classification of them (i.e., similarity, distance, and dependence), distinguishing between two types of input information: abundance and presence–absence of the species. Their list included several well-known statistical measures of dependence such as the Pearson, Kendall, and Spearman correlation coefficients. For each of these cases, they discussed the corresponding inferential procedure.

De Cáceres et al. (Reference de Cáceres, Font and Oliva2008) used two association measures – the Pearson correlation coefficient and the Ochiai index – as fidelity measures to identify diagnostic species of vegetation communities. These authors showed that, under certain conditions, the Ochiai index can be seen as an asymptotic approximation to the Pearson coefficient. Moreover, they noted that only for the Pearson coefficient there exists an associated test of independence (which in this context is known as a non-faithfulness test). They suggested that, instead of a test of hypothesis, a confidence interval could be used to evaluate the strength of the fidelity measured by an index. In their paper, approximate intervals were obtained using bootstrap techniques. In a related study, De Cáceres and Legendre (Reference de Cáceres and Legendre2009) dealt with the problem of assessing the association between species and groups of sites. They addressed the problem of selecting an appropriate association index and emphasized the importance of inferential analysis in this setting. To determine if some species are associated with site groups, they proposed a significance test for different indices (e.g., the Pearson coefficient and the Indicator Value index). Due to the lack of suitable distributional theory, the authors proposed the use of permutation tests; they recognized, however, that in some instances this approach can lead to poor results. As in the previous case, they relied on bootstrap techniques to produce confidence intervals.

In practice, even though association measures play an important role in modern ecological research, there do not seem to be clear guidelines as to which index is more appropriate for a given study. Furthermore, once an index is chosen, little attention is paid to the range of values it could take if computed from similar data, not to mention the plausible values of other indices for the same data. From a statistical point of view, an assessment of this uncertainty must be provided, and a common suggestion is to compute interval estimates. For traditional frequentist methods, it is often the case that the distributional theory – exact or asymptotic – necessary to obtain confidence intervals is not readily available. When comparing two indices, the use of joint confidence regions is even less common.

Hereafter, we will refer to the above-mentioned coefficients, measures, and indices simply as “indices” and will work only in the setting of presence–absence information. Also, we will use “association” instead of “joint occurrence,” “co-occurrence,” “coincidence,” or any other similar term. Regarding the analysis, we will make use of Bayesian statistical theory and show that a probability distribution can always be obtained that adequately describes the available information concerning a population association index. This so-called posterior distribution combines the observed data with a prior distribution that encapsulates the knowledge available before the study is conducted. The prior distribution could be used, for example, to suggest independence in a statistical sense or lack of association according to a specific index. The Bayesian analysis produces inferences – in particular, interval estimations – that do not require analytical approximations. Moreover, this approach allows us to deal simultaneously with several indices, and thus to gain insights into their joint behavior. Admittedly, in recent years, different options to analyze patterns of association of pairs of species have emerged, most of them associated with the increased suite of packages of the open-source program R. Some of them are mainly focused on providing new tools for assessing species association patterns such as the package cooccur (Griffith et al. Reference Griffith, Veech and Marsh2016), whereas others provide miscellaneous functions for analyzing species association and niche overlap, such as spaa (Zhang and Ma Reference Zhang and Ma2014). However, these packages rely on the use of frequentist statistical methods – with the limitations mentioned above – and do not always offer the possibility to calculate some of the most common indices. The purpose of this paper is to introduce a statistical procedure that can be used to obtain probability intervals for any association index. This procedure allows the researcher to produce other inferences regarding the index of interest, such as pointwise estimates and tests of hypotheses.

The paper is organized as follows. In the next section, we first discuss the structure of the data and describe a Bayesian approach to the analysis of the multinomial model. We also discuss the difference between ecological measures of association and statistical measures of stochastic dependence. We close the section by applying these ideas to produce inferences on association indices. In “Association indices,” we illustrate our method in the context of the analysis of a real dataset concerning camera-trap data of two species inhabiting the Montes Azules Biosphere Reserve in Southern Mexico. In “Discussion,” we discuss these results and introduce a complementary analysis. Finally, “Concluding remarks” contains some concluding remarks.

Methodology

Data structure

In a general setting, if a number r of species is considered and presence–absence observations are collected at c different sites, the information can be organized in a r x c table which is a particular instance of the ecological data matrix as discussed in Legendre and Legendre (Reference Legendre and Legendre1998, Chapter 7). Although some authors have addressed the study of association when more than two species are simultaneously considered (see Pielou Reference Pielou1972, for example), most of the association indices are designed to measure the association between two species. Hence, we will concentrate on the case $r\; = \;2$ , where only two species, ${E_1}$ and ${E_2}$ , are studied. In this case, if a sample of N observations (presence–absence) is available at a selected collection of sites, the data are usually summarized in a 2 x 2 array as shown in Table 1.

Table 1. Presence–absence of species E 1 and E 2.

In this table, the cells contain the frequencies of the only four possible cases, where: both species were observed ( ${n_{11}})$ ; species ${E_1}$ was observed but species ${E_2}$ was absent ( ${n_{10}}$ ); species ${E_1}$ was absent but the species ${E_2}$ was observed ( ${n_{01}}$ ); and finally, when neither of the species was observed ( ${n_{00}}$ ); here, ${n_{11}} + {n_{10}} + {n_{01}} + {n_{00}} = N$ . In the margins of the table are also reported: ${n_1}. = {\rm{\;}}{n_{11}} + {n_{10}}$ , the total number of cases where ${E_1}$ was present; ${n_0}. = {\rm{\;}}{n_{01}} + {n_{00}}$ , the total number of cases where ${E_1}$ was absent, and, in a similar fashion, $n{._{{\rm{\;}}1}}$ and $n{._{{\rm{\;}}0}}$ , the total number of cases where ${E_2}$ was present and absent, respectively. An equivalent summary is obtained if the observed frequencies are transformed into proportions, as shown in Table 2.

Table 2. Observed proportions. Presence–absence of species E 1 and E 2.

Here, we have ${p_{ij}} = {n_{ij}}/N$ for $i,j = 0,1;\;$ the observed proportions in the sample for the four possible cases. In any case, these sample data are used to describe the population they come from. For this purpose, statistical methods assume that $\left( {{n_{11}},{\rm{\;}}{n_{10}},{\rm{\;}}{n_{01}},{\rm{\;}}{n_{00}}} \right)$ is generated by a probability model. To this end, each observation is recorded as a vector of dimension $4$ . If the observation is allocated to the cell $\left( {i,\;j} \right)$ of Table 1, the associated vector $X$ has three entries equal to zero, and one entry equal to one where the only nonzero entry corresponds to the position $4 - 2i + j$ . Thus, the observed data are a collection of vectors ${X_1},\;.\;.\;.\;,\;{X_N}$ . These vectors are usually assumed to be randomly and independently sampled from the same population, so the statistic ${\boldsymbol X} = {{\Sigma}}_{k = 1}^N{X_k}$ can also be written as ${\boldsymbol X} = \left( {{n_{11}},{\rm{\;}}{n_{10}},{\rm{\;}}{n_{10}},{\rm{\;}}{n_{11}}} \right){\rm{\;}}$ and follows a multinomial distribution with parameters $\theta $ and $N$ , where $\theta = \left( {{\theta _{11}},\;{\theta _{10}},\;{\theta _{01}},\;{\theta _{00}}} \right)$ with ${\theta _{ij}} \gt 0$ , $\left( {i,j = 0,1} \right)$ and ${\theta _{11}} + {\theta _{10}} + {\theta _{01}} + {\theta _{11}} = 1$ .

Thus:

$$p\left( {{\boldsymbol x}{\rm{|}}\theta } \right){\rm{\;\;}} = {{{N!}}\over{{{\rm{\Pi }}_{i \,=\, 0}^1{\rm{\Pi }}_{j \,=\, 0}^1{\rm{\;}}{n_{ij}}!}}}{\rm{\;}} \times {\rm{\;\Pi }}_{i \,= \,0}^1{\rm{\Pi }}_{j \,=\, 0}^1\theta _{ij}^{{n_{ij}}}{\rm{\;\;\;}}$$

For every nonnegative integer ${n_{11}},\;{n_{10}},\;{n_{01}},\;{n_{00}}$ such that ${n_{11}} + {n_{10}} + {n_{01}} + {n_{11}} = N$ . In this model, ${\theta _{ij}}$ represents the probability that an observation be allocated to the $\left( {i,\;j} \right)$ -cell. Within this framework, the analysis aims to provide inferences about the population probabilities ${\theta _{11}},\;{\theta _{10}},\;{\theta _{01}},\;{\theta _{00}}$ using the observed frequencies in Table 1 or, equivalently, the observed proportions in Table 2.

Bayesian analysis for our multinomial model

Bayesian inference requires a prior distribution for $\theta $ , the only unknown parameter of the model. This prior distribution is then updated with the sample information to obtain the posterior distribution, given by $p(\theta \,|{\boldsymbol x}) \propto p\left( \theta \right)p({\boldsymbol x}\,|\,\theta )$ . In the specific case of our multinomial model, we have

$$p{\rm{(}}\;\theta \;{\rm{|}}\;{\boldsymbol x}) \propto p\left( \theta \right)\;\;{\rm{\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1\theta _{ij}^{{n_{ij}}}.$$

We can use any prior $p\left( \theta \right)$ in this expression if it describes the state of knowledge of the researcher before the sample information is available. A convenient class of models from which this prior can be chosen is the family of Dirichlet distributions,

$$p\left( \theta \right)\; \propto \;\;{\rm{\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1\;\theta _{ij}^{{\alpha _{ij}} - 1}\;,$$

where $\left\{ {{\alpha _{ij}} \gt 0;\;i,j = 0,\;1.} \right\}$ is a set of fixed constants. In this model, it can be verified that $E\left( {{\theta _{ij}}} \right) = {\alpha _{ij}}/\alpha $ and $Var\left( {{\theta _{ij}}} \right) =$ $\left[ {{\alpha _{ij}}\left( {1 - {\alpha _{ij}}} \right)} \right]/\left( {\alpha + 1} \right)$ , where $\alpha = {\rm{\Sigma }}_{i = 0}^1{\rm{\Sigma }}_{j = 0}^1{\alpha _{ij}}$ . Among other convenient features of the Dirichlet family of distributions, it is well known that it is conjugate to the multinomial sampling model. This means that, if the prior distribution is Dirichlet and the sample is distributed as multinomial, then the posterior distribution is also Dirichlet. Thus, in our case, we have the Dirichlet posterior:

$$p\left( {\;\theta \;|\,{\boldsymbol x}} \right){\rm{\;}} \propto {\rm{\;\;\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1{\rm{\;}}\theta _{ij}^{\left( {{\alpha _{ij}} + {n_{ij}}} \right) - 1}.$$

Of particular interest is the situation where the researcher has little prior information or is not willing or allowed to make inferences based on his/her prior beliefs. This can be addressed using a reference prior within the Dirichlet family. A common choice is to take α ij = 1/2 (see Box and Tiao Reference Box and Tiao1973, Section 1.3, and Berger and Bernardo Reference Berger and Bernardo1992). The corresponding posterior distribution is then:

$$p\left( {\;\theta \;|\;x\;} \right)\; \propto \;\;{\rm{\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1\;\theta _{ij}^{\left( {{n_{ij}} - 1/2} \right)},$$

and thus $E{\rm{(}}{\theta _{ij}}{\rm{|}}\;{\boldsymbol x}) = \left( {{n_{ij}} + 1/2} \right)/\left( {N + 2} \right)$ . This conjugate analysis in the general setting of contingency tables has been extensively discussed in the Bayesian literature. A recent review can be found in Gutiérrez-Peña and Mendoza (Reference Gutiérrez-Peña and Mendoza2017). Once the posterior distribution $p\left( {\theta \;|\;{\boldsymbol x}} \right)\;{\rm{is\;obtained}}$ ,inferences about $\theta $ describing some of its characteristics (location measures as point estimates or probability intervals, for example) can be produced. These features can be obtained analytically, as in the case of the mean and variance, or computed via numerical or simulation-based methods. For the interested reader, a brief account of the general Bayesian approach to inference is included in Section S.1 of the Supplementary Material.

Association indices

For the sake of simplicity, in this section, we will focus on the Ochiai and Pearson indices to illustrate our ideas, although it must be clear that the proposed analysis can be applied to any index. First, we will recall how the Ochiai and Pearson indices are calculated. For details, see Hubálek (Reference Hubálek1982).

  1. (i) Ochiai: $O = {n_{11}}/\sqrt {\left( {{n_{10}} + {n_{11}}} \right)\left( {{n_{01}} + {n_{11}}} \right)} $ ; $O \in \left[ {0,{\rm{\;}}1} \right].$

  2. (ii) Pearson: $R = ({n_{00}}{n_{11}} - {n_{10}}{n_{01}})/\sqrt {({n_0}.)({n_1}.)(n{._0})(n{._1})} $ ; $R \in \left[ { - 1,{\rm{\;}}1} \right]$ .

Each one of these indices provides a measure of the association between two species which can be used to assess the strength of such association in the observed sample. In the case of the Ochiai index, the closer the value of $O$ is to one, the stronger the association. In fact, $O$ reaches this maximum if and only if ${n_{10}} = {n_{01}} = 0$ (no discordant sites), whereas the minimum (zero) is attained when ${n_{11}} = 0$ if there is a positive number of discordant sites (otherwise it is not defined). The Pearson index also reaches its respective maximum value when there are no discordant sites. On the other hand, when the Ochiai index takes the value zero, and there are enough discordant sites, the Pearson coefficient can take a value close to $ - 1$ , suggesting a negative relationship. Conversely, if the Pearson coefficient is zero (suggesting no association), then the Ochiai index equals $\sqrt {{n_{11}}/N\;} $ which might be close to one. These results suggest that the assessment of the strength of the association of the species in the population, based on an observed value of a particular index in a specific sample, must be made with caution. The Supplementary Material (Section S.2.1) includes some numerical examples that illustrate these differences among indices.

In any case, it is easy to see that the indices considered in this section, as well as all others we reviewed in the literature, can be calculated in terms of the observed proportions instead of using the absolute frequencies. Take, as an example, the Ochiai index. According to the definition, $O = {n_{11}}/\sqrt {\left( {{n_{10}} + {n_{11}}} \right)\left( {{n_{01}} + {n_{11}}} \right)} $ . However, if we divide by $N$ both the numerator and denominator of this expression, we get $O = {p_{11}}/\sqrt {\left( {{p_{10}} + {p_{11}}} \right)\left( {{p_{01}} + {p_{11}}} \right)} $ .

The most noticeable consequence of this result stems from the fact that ${p_{ij}}$ is the observed proportion of cases in the $\left( {i,j} \right)$ cell in the sample and, thus, can be seen as an estimate of the true proportion of cases in that cell for the entire population. If we were able to fully observe the populations of each of these two species, we would know the true population proportions (probabilities) ${\theta _{11}},\;\;{\theta _{10}},\;\;{\theta _{01}}$ and ${\theta _{00}}$ . However, since the investigation is only conducted through a sample of size $N$ , instead of observing each ${\theta _{ij}}$ we obtain ${p_{ij}}$ such that ${p_{ij}} = {\hat \theta _{ij}}\;;\;i,j = 0,1.$ More importantly, in the same way, since every index $I$ can be calculated as $I = g\left( {{{\hat \theta }_{11}},\;{{\hat \theta }_{10}},\;{{\hat \theta }_{01}},\;{{\hat \theta }_{00}}} \right)$ , for some known function $g$ , it can be regarded as an estimate of the true value of the index in the population $\psi = g\left( {{\theta _{11}},\;{\theta _{10}},\;{\theta _{01}},\;{\theta _{00}}} \right)$ . If the purpose of the study is to investigate the pattern of association in the population, then the relevant question is: What can be said about $\psi $ , given the information in the sample? The usual answer is to produce the pointwise estimate $\hat \psi = I$ and then to say that $\psi $ has an approximate value equal to $I$ . However, at this point, when a frequentist approach to statistical inference is adopted, the problems described in the introduction appear. It is necessary to evaluate how far $\hat \psi $ is from $\psi $ and which other values of $\psi $ are compatible with the sample, in addition to $I$ itself. As discussed above, the procedures used to answer these questions usually make use of interval estimates for $\psi $ . The frequentist techniques rely on the sampling distribution of the estimate $I$ , which is only available in a few cases. So, for many indices, the uncertainty regarding the true strength of the association in the population cannot be properly assessed, although some approximations can be used, such as those based on bootstrap methods. As an alternative, in the following subsection, we introduce a Bayesian solution for this problem, whose calculations are straightforward and provide a general mechanism to produce interval estimates for $\psi $ and thus measure the uncertainty of interest.

Bayesian analysis of association indices

Consider an association index $\psi $ . As we discussed previously, $\psi = g\left( {{\theta _{11}},\;{\theta _{10}},\;{\theta _{01}},\;{\theta _{00}}} \right)$ is a known function of the (unknown) population parameter $\theta = \left( {{\theta _{11}},\;{\theta _{10}},\;{\theta _{01}},\;{\theta _{00}}} \right)$ . On the other hand, we have the sampling data $x = \left( {{n_{11}},\;{n_{10}},\;{n_{01}},\;{n_{00}}} \right)$ . To produce a Bayesian analysis of $\psi $ , the quantity of interest, we can proceed via a two-step procedure. First, we can get, as described in Section 2.2, the posterior distribution of $\theta $ . Then, in the second step, we can use the function $\psi = g\left( \theta \right)$ to get the posterior distribution $p{\rm{(\;}}\psi {\rm{\;|\;}}x{\rm{\;}}),$ via the transformation technique. Recall that if only vague prior information is available regarding $\theta $ , the recommendation is to use the reference prior:

$$p\left( {\;\theta \;} \right)\; \propto \;\;{\rm{\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1\;\theta _{ij}^{ - 1/2}\;.$$

This prior leads to the posterior:

$$p\left( {\theta \;|\;{\boldsymbol x}} \right)\; \propto \;\;{\rm{\Pi }}_{i = 0}^1{\rm{\Pi }}_{j = 0}^1\;\theta _{ij}^{\left( {{n_{ij}} - 1/2} \right)},$$

a Dirichlet distribution with parameter ${\alpha ^*} = ( {n_{11}} + {{1}\over{2}},\;{n_{10}} + {{1}\over{2}},$ ${n_{01}} + {{1}\over{2}},\;{n_{00}} + {{1}\over{2}} )$ . Before discussing the calculation of the posterior distribution of interest, $p\left( {\;\psi \;|\;{\boldsymbol x}} \right)$ , let us note some characteristics of this reference prior. It can be shown that $E({\theta _{ij}}) = {1 \over 4},\,\,E({\theta _{i.}}) = {1 \over 2},\,\,\,E(\theta {._j}) = {1 \over 2};\,\,i,j\, \in \{ 0,1\} $ so that $E({\theta _{ij}}) = \,E({\theta _{i.}}) \times \,E(\theta {._j})$ for every pair $\left( {i,j} \right)$ . Thus, a priori the expected value of $\theta $ satisfies the condition for stochastic independence, which can be regarded as noninformative since any other condition would imply some particular kind of association and, hence, some specific prior information. If we use this reference prior for $\theta $ , any posterior evidence against independence can be attributed solely to the information provided by the sample. In fact, examination of the corresponding prior distribution of $\psi $ may be useful to see the type of results that the index can produce under this scenario of independence. Going back to the posterior distribution of the index, since $\psi = g\left( \theta \right)$ we can describe the updated knowledge about the value of $\psi $ just by calculating $p{\rm{(}}\psi \;{\rm{|}}\;{\boldsymbol x})$ . Once we have this posterior distribution, and following De Cáceres et al. (Reference de Cáceres, Font and Oliva2008), we can produce a probability interval for $\psi $ . There are some indexes for which $p(\psi \;{\rm{|}}\;{\boldsymbol x})$ can be obtained from $p\left( {\theta {\rm{\;}}|{\rm{\;}}{\boldsymbol x}} \right)$ very easily using probability calculus. However, in some cases, the required analytic calculations may be cumbersome or may not be possible in closed form. Alternatively, we can use a Monte Carlo approach, simulating a sample $\left\{ {{\theta _k}} \right\}_{k = 1}^M$ from the Dirichlet model $p(\theta {\rm{\;|\;}}{\boldsymbol x})$ and then obtaining a sample from $p(\psi {\rm{\,|\,}}{\boldsymbol x})$ by applying the transformation directly to the simulated values of $({\psi _k} = g\left( {{\theta _k}} \right)$ , for each $k = 1, \ldots, {\rm{\;}}M)$ . For $M$ large enough, any characteristic of the distribution $p{\rm{(\;}}\psi {\rm{\;|\;}}{\boldsymbol x}{\rm{\;}})$ can be approximated with an arbitrary level of precision using the simulated sample $\left\{ {{\psi _k}} \right\}_{k = 1}^M$ . Simulation from $p{\rm{(}}\;\theta \;{\rm{|}}\;{\boldsymbol x}\;)$ is rather easy using ratios of Gamma random variables (see Kotz et al. Reference Kotz, Balakrishnan and Johnson2000, chapter 49, for example). Note that, when we simulate from this distribution, we will get a set of scenarios for the true population proportions that are compatible with the observed data. Moreover, the most probable scenarios, according to the data, will appear more frequently in the sequence $\left\{ {{\theta _k}} \right\}_{k = 1}^M$ . In a similar way, if we calculate ${\psi _k} = g\left( {{\theta _k}} \right)$ , for $k = 1, \ldots, \;M$ the set $\left\{ \psi{_k} \right\}_{k = 1}^M$ is a collection of values of the true population index that are compatible with the information in the sample $x$ and the most probable values will appear more frequently in this set. To produce interval estimates, if $M$ is large enough (say several thousand) and we find that $95\% $ of the simulated values lie, for example, in the interval $\left[ {0.78,\;0.84} \right]$ , we will be able to say that the true value of the population index $\psi $ lies in the interval $\left[ {0.78,\;0.84} \right]$ with probability $0.95$ . This sort of assertion clarifies the uncertainty that remains a posteriori about the value of $\psi $ . It must be clear that the simulation algorithm in our proposal is just a tool to calculate, exactly, the posterior distribution $p(\psi |{\boldsymbol x})$ . We are not using any bootstrap method; we are not resampling the observed data nor applying any form of a null model technique.

Our proposed Bayesian approach produces inferences that are, for all practical purposes, numerically exact for any sample size. These ideas will be illustrated in the following section. To conduct the corresponding analyses, we have developed the R package basa (Bayesian Analysis of Association Indices). This package automatically takes care of the simulation step and allows the user to calculate the posterior distribution and the relevant summaries for any association index, not just for those discussed in this paper.

Application to field data

As part of a project conducted in the Montes Azules biosphere reserve (Southern Mexico) in 2015, a camera-trap dataset recording the presence–absence of several species was generated. We will only be concerned with a specific pair of mammal species, namely the Central American agouti (Dasyprocta punctata) and the white-nosed coati (Nasua narica), for which we have a sample of $N = 40$ records (see Table 3). Also, we will focus on two association indices (Ochiai and Pearson); however, the analysis can be replicated for any other index. In all cases, we used a reference prior and a simulated sample of size $M\; = \;10,\;000$ from the posterior distribution of $\theta $ to obtain the results. For the sake of comparison, we searched for a computation resource to produce a frequentist counterpart of our results. In the case of the Ochiai and Pearson indexes, we only could find the R package spaa (Zhang and Ma Reference Zhang and Ma2014), which only produces pointwise estimates. The comparison is then based on these estimates.

Table 3. Dasyprocta punctata and Nasua narica camera trap data.

We compare the results obtained a priori with those obtained a posteriori. In this way, we can assess the impact of the data when we initially assume that there is no association.

  1. (i) Ochiai index.

The prior distribution for this index is shown in Fig. 1(a) (left panel). Apart from a mode at zero, the density is rather flat over the interval $\left( {0,\;1} \right)$ . The prior mean and median are given by $0.453$ and $0.440$ , respectively. A $0.95$ interval estimate is given by $\left( {0,\;0.944} \right)$ , showing that the Ochiai index can be practically anywhere even though there is a clear mode at zero. Once the $40$ observations are considered, the posterior distribution is that shown on the right panel of Fig. 1(a). The distribution is now bell-shaped, with a mean of $0.291$ and a median of $0.288$ . Any one of these two values could be used as a point-wise estimate of the true value of the index. If the usual frequentist based Ochiai index is computed with the observed frequencies, we get $0.298$ , a quite similar value. In comparison, with our method we can say that a posteriori, the true value of the index lies in the interval $\left( {0.100,\;0.498} \right)$ with probability $0.95$ , thus providing more information and suggesting a moderate level of association.

  1. (ii) Pearson index.

Figure 1. Prior and posterior distributions of the unknown population value of: (a) Ochiai index and (b) Pearson index. Dasyprocta punctata and Nasua narica data.

The prior and posterior distributions for the Pearson index are shown in Fig. 1(b). On the left-hand side, we have the prior distribution. It is symmetric around zero, with mode, mean, and median all equal to this value. The $0.95$ prior interval estimate is given by $\left( { - 0.903,\;0.906} \right)$ . On the other hand, the posterior distribution (right-hand side) is a more concentrated bell-shaped curve with a mean of $0.026$ and a median of $0.031$ , both quite like the value of the usual frequentist coefficient $\left( {0.038} \right)$ . Again, our method provides more information. We can say that a posteriori, the true correlation coefficient lies in the interval $\left( { - 0.263,\;0.285} \right)$ with a probability of $0.95$ . This suggests that the two species have a low level of association and, since the value zero belongs to this interval, it can be reasonable to say that the posterior distribution suggests that there is no association between these two species.

Discussion

Ecological indices are designed to measure association (or lack thereof) differently compared to statistical indices, whose purpose is to measure stochastic dependence. The results obtained for the Dasyprocta punctata and the Nasua narica data make this point very clear. The posterior distribution of the Pearson statistical index shows a mode close to zero and is rather concentrated around this value, thus suggesting stochastic independence. This is confirmed by the Bayesian test of independence mentioned at the end of Section S.2.2 of the Supplementary Material.

On the other hand, the prior-posterior analysis of the Ochiai index shows that the prior mode at zero has been replaced by a positive value in the posterior distribution, and, for example, the probability of the interval $\left( {0,\;0.1} \right)$ changes from $0.188$ a priori to only $0.012$ a posteriori. Thus, it can be said that global analysis of these two indices points to a moderate level of association.

The posterior distribution of an association index can also be used to answer other questions of interest. For example, if the idea that $\psi \in A$ for a given set $A$ has been entertained before the data are collected, the validity of the hypothesis ${H_0}:\;\psi \in A$ can be evaluated using its posterior probability, $P{\rm{(}}\;\psi \in A\;{\rm{|}}\;x\;)$ . Specifically, in the case of our application regarding the Ochiai index, let us denote the population version of O as ${\rm{\Omega }}$ . Then, if the hypothesis ${H_0}:\;{\rm{\Omega }} \ge 0.2$ was of interest, we could decide on its validity by considering that $P{\rm{(}}\;{\rm{\Omega }} \ge 0.2\;{\rm{|}}\;x\;) = 0.8.$

Furthermore, if we have two indices that are not functionally related and measure different kinds of association, it could be useful to analyze their joint posterior distribution. This type of analysis might shed some light on the complementary nature of the indices and is essentially unfeasible with the frequentist approach. This information can be obtained using the simulated sample $\left\{ {{\theta _k}} \right\}_{k = 1}^M$ from the posterior distribution $\theta $ . More specifically, if we have two indices, ${\psi _1}$ and ${\psi _2}$ , and for each ${\theta _k}$ in the sample we calculate $\left( {{\psi _1}\left( {{\theta _k}} \right),\;{\psi _2}\left( {{\theta _k}} \right)} \right)$ , then the set $\left\{ {\left( {{\psi _1}\left( {{\theta _k}} \right),\;{\psi _2}\left( {{\theta _k}} \right)} \right)} \right\}_{k = 1}^M$ is a random sample from the joint distribution of $\left( {{\psi _1},\;{\psi _2}} \right)$ . To illustrate this, Fig. 2 shows the joint posterior distribution of the population version of $\left( {R,\;O} \right)$ for the Dasyprocta punctata and Nasua narica data.

Figure 2. Joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Dasyprocta punctata and Nasua narica data.

This graph shows that when the Pearson index takes relatively high values (suggesting a positive association among species), the Ochiai behaves similarly. On the other hand, when the Pearson index takes low values (evidence of a negative association), the Ochiai index takes values near zero which suggests no association. Finally, if the Pearson index takes values very close to zero, suggesting no association, the Ochiai index can take values between 0.2 and 0.4, which may be interpreted as evidence of a moderate positive association. It is interesting to compare this joint posterior distribution with those obtained for the two pairs of tree species discussed in Section S.2.2 of the Supplementary Material. There we examine the posterior distribution of the Ochiai index and the Pearson index corresponding to the pairs Bur oak and Black oak (Fig. 3a) and Red oak and American elm (Fig. 3b), respectively. The most relevant issue here is that there is more uncertainty regarding the relationship between the Ochiai index and the Pearson index in the case of the two tree datasets compared with our dataset concerning mammal species. This result is not surprising if we note that the tree datasets involve smaller sample sizes (10 observations against 40 in our data set). In any case, the relevant lesson here is that the single value of an index cannot be properly evaluated and interpreted without regard to the corresponding uncertainty which, among other factors, is related to the sample size.

Figure 3. (a) Joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Bur oak and Black oak data; (b) joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Red oak and American elm data.

We can say that ecological association and statistical association are not synonymous. In fact, some ecological indices suggest a high degree of association between species when, in statistical terms, there is independence. Similarly, for some datasets, the degree of ecological association can be low, but the corresponding statistical measures can indicate strong dependence. This fact is extremely interesting from a statistical point of view. It is also important because some of these ecological indices are widely used. See Kalgaotra et al. (Reference Kalgarota, Sharda and Luse2020), for example, where the Ochiai index is used to define edges between the nodes of a network in the context of medical research. A careful analysis of each ecological index would allow us to fully understand what kind of patterns it can detect. In any case, and regardless of the type of association that it describes, an ecological association index calculated from a sample is intended to shed light on the relationship between the species under study and is thus part of an inferential process. We believe this study can help promote a better understanding of the statistical basis of species association indices, thus facilitating a better interpretation of their ecological implications.

Concluding remarks

We have shown that a Dirichlet prior distribution for the probabilities of a multinomial model provides a suitable framework to easily obtain an adequate description of the uncertainty about any of the association indices. Our proposal uses a reference prior that does not depend on the index under study, but informative priors can be easily accommodated. In addition, the method proposed here also allows the researcher to obtain the joint posterior distribution of any set of indices. This is relevant because examination of the distribution for a pair of indices can be useful to understand what the indices are measuring, both individually and together. For example, the joint posterior distribution allows us to compute conditional interval estimates for one index given a fixed value of another index. In any case, the researchers who wish to explore the advantages of our proposal can do so using the R package basa available from: http://lacb-inirena.mx/bayesian-analysis-of-species-associations/. A tutorial on the use of this package is included in the Supplementary Material (Section S.2.3).

Supplementary material

To view supplementary material for this article, please visit https://doi.org/10.1017/S0266467424000105

Availability of data and material

All data generated or analyzed during this study are included in this article.

Acknowledgments

The authors wish to thank the Sistema Nacional de Investigadores, México. The first author is also grateful to the Asociación Mexicana de Cultura, A.C.

Funding statement

This work was supported by Project IN106114–3 of the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (DGAPA-UNAM, México).

Competing interests

The authors declare no conflict of interest.

Code availability

The R package basa is available from https://lacb-inirena.mx/bayesian-analysis-of-species-associations/

References

Berger, JO and Bernardo, JM (1992) Ordered group reference priors, with applications to multinomial problems. Biometrika 79, 2537.CrossRefGoogle Scholar
Box, GEP and Tiao, GC (1973) Bayesian Inference in Statistical Analysis. Reading, MI: Addison-Wesley.Google Scholar
Calvert, PP (1922) Methods for expressing the associations of different species. Ecology 3, 163165.CrossRefGoogle Scholar
de Cáceres, M, Font, X and Oliva, F (2008) Assessing species diagnostic value in large data sets: a comparison between Phi-coefficient and Ochiai index. Journal of Vegetation Science 19, 779788.CrossRefGoogle Scholar
de Cáceres, M and Legendre, P (2009) Associations between species and groups of sites: indices and statistical inference. Ecology 90, 35663574.CrossRefGoogle ScholarPubMed
Dice, LR (1945) Measures of the amount of ecologic association between species. Ecology 26, 297302.CrossRefGoogle Scholar
Forbes, SA (1907) On the local distribution of certain Illinois fishes: an essay on Statistical Ecology. Bulletin of the Illinois State Laboratory of Natural History VII, 273351.CrossRefGoogle Scholar
Griffith, DM, Veech, JA and Marsh, CJ (2016) Cooccur: probabilistic species co-occurrence analysis in R. Journal of Statistical Software 69, 117. https://doi.org/10.18637/jss.v069.c02 CrossRefGoogle Scholar
Gutiérrez-Peña, E and Mendoza, M (2017) Bayesian methods for categorical data. In Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06134.pub2 Google Scholar
Hubálek, Z (1982) Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biological Reviews 57, 669689.CrossRefGoogle Scholar
Kalgarota, P, Sharda, R and Luse, A (2020) Which similarity measure to use in network analysis: impact of sample size on phi correlation and Ochiai index. International Journal of Information Management 55, 102229.Google Scholar
Kotz, S, Balakrishnan, N and Johnson, NL (2000) Continuous Multivariate Distributions. Vol 1. Models and Applications. New York: Wiley.CrossRefGoogle Scholar
Lavender, TM, Schamp, BS, Arnott, SE and Rusak, JA (2019) A comparative evaluation of five common pairwise tests of species association. Ecology 100, 18.CrossRefGoogle ScholarPubMed
Legendre, P and Legendre, L (1998) Numerical Ecology. Amsterdam: Elsevier.Google Scholar
Michael, EL (1920) Marine ecology and the coefficient of association: a plea in behalf of quantitative biology. Journal of Ecology 8, 5459.CrossRefGoogle Scholar
Pielou, EC (1967) The detection of different degrees of coexistence. Journal of Theoretical Biology 16, 427437.CrossRefGoogle ScholarPubMed
Pielou, EC (1972) 2 k Contingency tables in ecology. Journal of Theoretical Biology 34, 337352.CrossRefGoogle ScholarPubMed
Zhang, JL and Ma, KP (2014) spaa: an R package for computing species association and niche overlap. Research Progress of Biodiversity Conservation in China X, 165174 (in Chinese).Google Scholar
Figure 0

Table 1. Presence–absence of species E1 and E2.

Figure 1

Table 2. Observed proportions. Presence–absence of species E1 and E2.

Figure 2

Table 3. Dasyprocta punctata and Nasua narica camera trap data.

Figure 3

Figure 1. Prior and posterior distributions of the unknown population value of: (a) Ochiai index and (b) Pearson index. Dasyprocta punctata and Nasua narica data.

Figure 4

Figure 2. Joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Dasyprocta punctata and Nasua narica data.

Figure 5

Figure 3. (a) Joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Bur oak and Black oak data; (b) joint posterior distribution of the unknown population values of the R (Pearson) and O (Ochiai) indices. Red oak and American elm data.

Supplementary material: File

Mendoza et al. supplementary material

Mendoza et al. supplementary material
Download Mendoza et al. supplementary material(File)
File 1.2 MB