GTR and nested models
The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for their popularity is that they are computationally convenient, owing to their major properties, which are:
- Reversibility (the process is identical forward and backwards in time)
- Homogeneity (the process is identical across an entire tree)
- Stationarity (the process converges on a stationary distribution)
The “process” here refers to the model of substitutions along branches of a
tree, which for GTR models are specified by a set of substitution rates and
base frequencies. The simplest GTR model is Jukes–Cantor (1969) or JC, where
the substitution rates are all equal and base frequencies are all equal.
Therefore for the JC model there is only a single rate
We can use these models to calculate the likelihood of a branch length given
the nucleotides observed at either end. To do this we need an instantaneous
rate matrix or
A | C | G | T | |
---|---|---|---|---|
A | ||||
C | ||||
G | ||||
T |
Notice that the diagonal elements are chosen to make each row sum to zero.
The probability of observing the nucleotide pairs in the matrix given a
branch length of time
For Jukes–Cantor, after matrix exponentiation, the diagonal likelihoods (the
probabilities that the state at the end of a branch will be the same as
at the beginning, given the branch length
The off-diagonal likelihoods (the probabilities that the state at the end of a
branch will be different from the beginning, given the branch length
These formulae are derived from equation 6.8 of The Phylogenetic Handbook.
The branch length likelihood for a pairwise sequence of many sites is
proportional to the product of the above across all sites. We can also solve
this to get the maximum likelihood branch length
Consider what happens as
One common GTR family model besides JC is HKY, where the base frequencies are
unequal, and the rate of transitions can be different from the rate of
transversions. HKY is parameterized by the base frequency parameters and
The GTR model itself has unequal base frequencies and six substitution rates. There are only six rates as the forward rate (e.g. A to T) is identical to the reverse rate (e.g. T to A). As mentioned before, this reversibility makes GTR and nested models (like HKY and JC) computationally relatively easy.
Updated December 5, 2019 to use formulae from Schmidt and von Haeseler