# GTR and nested models

The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for their popularity is that they are computationally convenient, owing to their major properties, which are:

- Reversibility (the process is identical forward and backwards in time)
- Homogeneity (the process is identical across an entire tree)
- Stationarity (the process converges on a stationary distribution)

The “process” here refers to substitution rates and base frequencies. The simplest GTR model is Jukes–Cantor (1969) or JC, where the substitution rates are all equal and base frequencies are all equal. Therefore for the JC model there is only a single rate “μ” of substitutions per time.

We can use these models to calculate the likelihood of a branch length given the nucleotides observed at either end. To do this we need an instantaneous rate matrix or Q matrix. For JC this is:

A | C | G | T | |
---|---|---|---|---|

A | -3/4μ | 1/4μ | 1/4μ | 1/4μ |

C | 1/4μ | -3/4μ | 1/4μ | 1/4μ |

G | 1/4μ | 1/4μ | -3/4μ | 1/4μ |

T | 1/4μ | 1/4μ | 1/4μ | -3/4μ |

Notice that the diagonal elements are chosen to make the sum of each row and
column zero. The probability of observing the nucleotides pairs in the matrix
given a branch length of time *t* is P(*t*) = exp(Q*t*), that is the matrix
exponential after multiplying each element of the Q matrix by *t*. The
probability of observing

For Jukes–Cantor, after matrix exponentiation, the diagonal likelihoods (the
probability that a sequence site is identical at
both ends given the branch length μ*t*) of the new matrix are always:

1/4 + 3/4 × exp(-4μ*t*)

The off-diagonal likelihoods (the probability of some mismatch given the
branch length μ*t*) are always:

1/4 - 1/4 × exp(-4μ*t*)

These formulae are derived from Tuffley and Steel (1997).

The likelihood for a pairwise sequence of many sites is simply the product of
the above across all sites. We can also solve this to get the maximum
likelihood branch length *d* from the *p*-distance between two sequences:

*d* = -3/4 × log(1 - 4/3*p*)

Consider what happens as *p* goes to 0.75, why it cannot exceed 0.75, and why
if JC is the true model it should never exceed 0.75 by much, even for
extremely long branches!

One common GTR family model besides JC is HKY, where the base frequencies are unequal, and the rate of transitions can be different from the rate of transversions. HKY is parameterized by the base frequency parameters and “κ”, where κ is double the ratio of transitions to transversions. When κ = 1, the next mutation being a transition is half as likely as the next mutation being a transversion, and therefore all possible substitutions will have equal rates.

The GTR model itself has unequal base frequencies and six substitution rates. There are only six rates as the forward rate (e.g. A to T) is identical to the reverse rate (e.g. T to A). As mentioned before, this reversibility makes GTR and nested models (like HKY and JC) computationally relatively easy.