In this vignette, we explain how one can estimate the marginal likelihood using the LoRaD method (Wang et al. 2023).

# Data

Two DNA sequences, each of length 200 sites, were simulated under the K80 substitution model (Kimura 1980). This model has two parameters:

• transition/transversion rate ratio, $$\kappa$$: the ratio of the instantaneous rate of transition-type substitutions (A $$\leftrightarrow$$G, C$$\leftrightarrow$$T) to the instantaneous rate of transversion-type substitutions (A$$\leftrightarrow$$C, A$$\leftrightarrow$$T, G$$\leftrightarrow$$C, G$$\leftrightarrow$$T).

• edge length, $$v$$ : the evolutionary distance between the two sequences measured in expected number of substitutions per site.

# Model

A continuous-time, 4-state Markov model was used for both simulation and analysis. The instantaneous rate matrix $${\bf Q}$$ that defines the K80 substitution model is

\begin{align} {\bf Q} &= \left[ \begin{array}{cccc} -\beta(\kappa + 2) & \beta & \kappa \beta & \beta \\ \beta & -\beta(\kappa + 2) & \beta & \kappa \beta \\ \kappa \beta & \beta & -\beta(\kappa + 2) & \beta \\ \beta & \kappa \beta & \beta & -\beta(\kappa + 2) \\ \end{array} \right], \end{align}

where the order of (e.g. DNA) states for both rows and columns is A, C, G, T. The matrix of transition probabilities can be obtained by exponentiating the product of $${\bf Q}$$ and time $$t$$:

\begin{align} {\bf P} &= e^{{\bf Q}t}. \end{align}

The product of the base rate $$\beta$$ and time $$t$$ can be determined from the edge length parameter $$v$$ and transition/transversion rate ratio parameter $$\kappa$$. The edge length is the product of the overall substitution rate ($$0.25 (8 \beta + 4 \kappa \beta)$$) and time ($$t$$), yielding

\begin{align} v &= 2 \beta t + \kappa \beta t \\ \beta t &= \frac{v}{2 + \kappa} \end{align}

The transition probabilities may be used to simulate data for one of the two sequences given the other sequence. The state for each site in the starting sequence is drawn from the stationary distribution $$\pi_A=\pi_C=\pi_G=\pi_T=0.25$$ implied by $${\bf Q}$$.

Given simulated sequence data $${\bf D}$$, the joint posterior distribution $$p(v, \kappa|{\bf D})$$ was sampled using MCMC, with likelihoods computed using the K80 model and using Exponential priors with mean equal to 50 for both $$v$$ and $$\kappa$$. The posterior sample (10,000 points sampled 100 iterations apart) after a burn-in of 1000 iterations was saved tab-delimited text file k80-samples.txt. In the file there are four columns, iter (MCMC-iteration), log-kernel (log posterior kernel), edgelen (sampled edge length), and kappa (sampled kappa parameter).

In this example we will compute an estimate of the marginal likelihood under this model using the LoRaD method.

library(lorad)

Read in the file containing the posterior sample.

# Dimensions of k80samples
dim(k80samples)
#> [1] 10000     4

# Column names of k80samples
colnames(k80samples)
#> [1] "iter"       "log.kernel" "edgelen"    "kappa"

# Column Specification

Next, we must create a named vector to associate each column name in k80samples with a column specification. Here are the possible column specifications:

• iteration: The MCMC iteration
• posterior: The values in this column are components of the posterior (log scale)
• nonnegative: The parameter has non-negative support $$[0,\infty)$$
• unconstrained: The parameter has support $$(-\infty,\infty)$$
• ignore: The column should be ignored

All columns labeled posterior will be summed to create the log posterior kernel (sum of log likelihood and log joint prior).

Here are the column specifications appropriate for this example.

Column Specification
iter iteration
log.kernel posterior
edgelen positive
kappa positive
# Create a named vector holding the column specifications
colspec <- c("iter"="iteration", "log.kernel"="posterior", "edgelen"="positive", "kappa"="positive")

The LoRaD requires all parameters to be unconstrained in their support. Thus, the positive column specification for both edgelen and kappa results in a log transformation prior to application of the LoRaD method.

# Computing an Estimate of the Marginal Likelihood

To run the LoRaD function we need to specify the training fraction, the training sample selection method, and the coverage fraction. The training fraction and the coverage fraction must be between 0 and 1. The training sample selection method can be left (the first part of the sample), right (the second part of the sample), or random (a random part of the sample).

For this example we used 0.5 for the training fraction, 0.1 for the coverage fraction, and “left” for the training sample selection method.

results <- lorad_estimate(k80samples, colspec, 0.5, "left", 0.1)
#>     Parameter sample comprises 10000 sampled points
#>     Each sample point is a vector of 2 parameter values
#>
#>    Training fraction is 0.5
#>     Coverage specified is 0.1
#>
#> Partitioning samples into training and estimation:
#>     Sample size is 10000
#>     Training sample size is 5000
#>     Estimation sample size 5000
#>
#> Processing training sample...
#>     Lowest radial distance is 0.464997772
#>     Log Delta -2.278161285
#>
#> Processing estimation sample...
#>     Number of samples used is 510
#>     Nominal coverage specified is 0.100000
#>     Realized coverage is 0.102000
#>     Log marginal likelihood is -460.822385
#> 

For comparison, the two log marginal likelihood estimates reported by Wang et al. (2023) were -460.82239 (LoRaD method) and -460.86154 (Steppingstone method).

## Literature Cited

Kimura, Motoo. 1980. “A Simple Method for Estimating Evolutionary Rates of Base Substitutions Through Comparative Studies of Nucleotide Sequences.” Journal of Molecular Evolution 16 (2): 111–20. https://doi.org/10.1007/BF01731581.
Wang, Yu-Bo, Analisa Milkey, Aolan Li, Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis. 2023. “LoRaD: Marginal Likelihood Estimation with Haste (but No Waste).” Systematic Biology 72 (3): 639–48. https://doi.org/10.1093/sysbio/syad007.