GTR and nested models

2 minute read

The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for their popularity is that they are computationally convenient, owing to their major properties, which are:

  1. Reversibility (the process is identical forward and backwards in time)
  2. Homogeneity (the process is identical across an entire tree)
  3. Stationarity (the process converges on a stationary distribution)

The “process” here refers to the model of substitutions along branches of a tree, which for GTR models are specified by a set of substitution rates and base frequencies. The simplest GTR model is Jukes–Cantor (1969) or JC, where the substitution rates are all equal and base frequencies are all equal. Therefore for the JC model there is only a single rate \(\mu\) of substitutions per site per time.

We can use these models to calculate the likelihood of a branch length given the nucleotides observed at either end. To do this we need an instantaneous rate matrix or \(Q\) matrix. For JC this is:

  A C G T
A \(-3/4\mu\) \(1/4\mu\) \(1/4\mu\) \(1/4\mu\)
C \(1/4\mu\) \(-3/4\mu\) \(1/4\mu\) \(1/4\mu\)
G \(1/4\mu\) \(1/4\mu\) \(-3/4\mu\) \(1/4\mu\)
T \(1/4\mu\) \(1/4\mu\) \(1/4\mu\) \(-3/4\mu\)

Notice that the diagonal elements are chosen to make each row sum to zero. The probability of observing the nucleotide pairs in the matrix given a branch length of time \(t\) is \(P(t) = \exp\left(Qt\right)\) that is the matrix exponential after multiplying each element of the \(Q\) matrix by \(t\).

For Jukes–Cantor, after matrix exponentiation, the diagonal likelihoods (the probabilities that the state at the end of a branch will be the same as at the beginning, given the branch length \(\mu t\)) of the new matrix are always:

\[\frac{1}{4}\left(1 + 3e^{-\frac{4}{3}\mu t}\right)\]

The off-diagonal likelihoods (the probabilities that the state at the end of a branch will be different from the beginning, given the branch length \(\mu t\)) are always:

\[\frac{1}{4}\left(1 - e^{-\frac{4}{3}\mu t}\right)\]

These formulae are derived from equation 6.8 of The Phylogenetic Handbook.

The branch length likelihood for a pairwise sequence of many sites is proportional to the product of the above across all sites. We can also solve this to get the maximum likelihood branch length \(d = \mu t\) from the \(p\) distance between two sequences (derived from equation 6.10 of the Phylogenetic Handbook):

\[d = -\frac{3}{4} \log\left(1 - \frac{4}{3} p\right)\]

Consider what happens as \(p\) goes to 0.75, why it cannot exceed 0.75, and why if JC is the true model it should never exceed 0.75 by much, even for extremely long branches!

One common GTR family model besides JC is HKY, where the base frequencies are unequal, and the rate of transitions can be different from the rate of transversions. HKY is parameterized by the base frequency parameters and \(\kappa\), where \(\kappa\) is double the ratio of transitions to transversions. When \(\kappa = 1\), the next mutation being a transition is half as likely as the next mutation being a transversion, and therefore all possible substitutions will have equal rates.

The GTR model itself has unequal base frequencies and six substitution rates. There are only six rates as the forward rate (e.g. A to T) is identical to the reverse rate (e.g. T to A). As mentioned before, this reversibility makes GTR and nested models (like HKY and JC) computationally relatively easy.

Updated December 5, 2019 to use formulae from Schmidt and von Haeseler