The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for their popularity is that they are computationally convenient, owing to their major properties, which are:

  1. Reversibility (the process is identical forward and backwards in time)
  2. Homogeneity (the process is identical across an entire tree)
  3. Stationarity (the process converges on a stationary distribution)

The “process” here refers to substitution rates and base frequencies. The simplest GTR model is Jukes–Cantor (1969) or JC, where the substitution rates are all equal and base frequencies are all equal. Therefore for the JC model there is only a single rate “μ” of substitutions per time.

We can use these models to calculate the likelihood of a branch length given the nucleotides observed at either end. To do this we need an instantaneous rate matrix or Q matrix. For JC this is:

  A C G T
A -3/4μ 1/4μ 1/4μ 1/4μ
C 1/4μ -3/4μ 1/4μ 1/4μ
G 1/4μ 1/4μ -3/4μ 1/4μ
T 1/4μ 1/4μ 1/4μ -3/4μ

Notice that the diagonal elements are chosen to make the sum of each row and column zero. The probability of observing the nucleotides pairs in the matrix given a branch length of time t is P(t) = exp(Qt), that is the matrix exponential after multiplying each element of the Q matrix by t. The probability of observing

For Jukes–Cantor, after matrix exponentiation, the diagonal likelihoods (the probability that a sequence site is identical at both ends given the branch length μt) of the new matrix are always:

1/4 + 3/4 × exp(-4μt)

The off-diagonal likelihoods (the probability of some mismatch given the branch length μt) are always:

1/4 - 1/4 × exp(-4μt)

These formulae are derived from Tuffley and Steel (1997).

The likelihood for a pairwise sequence of many sites is simply the product of the above across all sites. We can also solve this to get the maximum likelihood branch length d from the p-distance between two sequences:

d = -3/4 × log(1 - 4/3p)

Consider what happens as p goes to 0.75, why it cannot exceed 0.75, and why if JC is the true model it should never exceed 0.75 by much, even for extremely long branches!

One common GTR family model besides JC is HKY, where the base frequencies are unequal, and the rate of transitions can be different from the rate of transversions. HKY is parameterized by the base frequency parameters and “κ”, where κ is double the ratio of transitions to transversions. When κ = 1, the next mutation being a transition is half as likely as the next mutation being a transversion, and therefore all possible substitutions will have equal rates.

The GTR model itself has unequal base frequencies and six substitution rates. There are only six rates as the forward rate (e.g. A to T) is identical to the reverse rate (e.g. T to A). As mentioned before, this reversibility makes GTR and nested models (like HKY and JC) computationally relatively easy.