The Viterbi algorithm identifies a single path of hidden Markov model (HMM) states. This is the path which maximizes the joint probability of the observed data (e.g. a nucleotide or amino acid sequence) and the hidden states, given the HMM (including transition and emission frequencies).
Maybe this path is almost certainly correct, but it also might represent one of many plausible paths. Putting things quantitatively, the Viterbi result might have a 99.9% probability of being the true1 path, or a 0.1% probability. It would be useful to know these probabilities in order to understand when our results obtained from the Viterbi algorithm are reliable.
Recall that the joint probability is P(π,x|M) where (in the context of biological sequence analysis) x is the biological sequence, π is the state path, and M is the HMM. Following the chain rule, this is equivalent to P(x|π,M) × P(π|M). The maximum joint probability returned by the Viterbi algorithm is therefore the product of the likelihood of the sequence given the state path, and the prior probability of the state path! Previously I have described the product of the likelihood and prior as the unnormalized posterior probability. The parameter values which maximize that product, and therefore the state path returned by the Viterbi algorithm, is often known as the “maximum a posteriori” solution.
The posterior probability is obtained by dividing the unnormalized posterior probability (which can be obtained using the Viterbi algorithm) by the marginal likelihood. The marginal likelihood can be calculated using the forward algorithm.
The intermediate probabilities calculated using the Viterbi algorithm are the probabilities of a state path π and a biological sequence x up to some step i: P(π1..i,x1..i|M). The intermediate probabilities calculated using the forward algorithm are similar but marginalize over the state path up to step i: P(πi,x1..i|M). Put another way, the probability is for the state at position i integrated over the path followed to that state.
This marginalization is achieved by summing over the choices made at each step. When calculating probabilities, summing typically achieves the result of X or Y (e.g., a high OR low path), whereas a product typically achieves the result of X and Y (e.g. a high then low path).
Just as for the Viterbi algorithm, it is sensible to work in log space to avoid numerical underflows and loss of precision. As an alternative to working in log space while still avoiding those errors, the marginal probabilities of all states at any position along the sequence can be rescaled (see section 3.6 of Biological Sequence Analysis). However if those marginal probabilities are very different in magnitude, there can still be numerical errors even with rescaling, so from here on we will work in log space.
As an example we will use the forward algorithm to calculate the marginal
likelihood of the sequence
GGCACTGAA and HMM used in the Viterbi
example. Initialize an empty forward matrix with the start state
probabilities filled in.
To calculate the marginal probability of the first state being CpG, we need to sum over the probabilities of the CpG state at the current position i (and the sequence up to and including i) given that the previous state was start, CpG, or non-CpG. The respective log probabilities are -1.7, -∞, and -∞. The log sum of probabilities is therefore log(exp(-1.7) + exp(-∞) + exp(-∞)) = log(exp(-1.7) + 0 + 0) = -1.7.
A similar calculation for the non-CpG state results in -2.7. Notice that the result for the first position is the same as for the Viterbi algorithm; this makes sense, as there is only one previous state path to marginalize over, consisting of the start state at the zeroth position. Because of the marginalization, there are no pointer arrows to add to the forward matrix:
The second position is more interesting, as we have to marginalize over multiple non-zero probabilities of the state at the previous position. For the CpG state at the second position, the log probabilities are:
- k1 = 0: -∞ + -0.7 + -1.0 = -∞
- k1 = 1: -1.7 + -0.5 + -1.0 = -3.2
- k1 = 2: -2.7 + -1.0 + -1.0 = -4.7
The log sum of probabilities (to one decimal place) is therefore log(0 +
exp(-3.2) + exp(-4.7)) = -3.0. This calculation can be performed in one step
using the Python function
scipy.special.logsumexp. To use this command
scipy must be installed and the
scipy.special subpackage must be
imported. In fact, it should be performed in one step to avoid the
aforementioned numerical errors.
The log probabilities for the non-CpG state at the second position are:
- k1 = 0: -∞ + -0.7 + -2.0 = -∞
- k1 = 1: -1.7 + -1.0 + -2.0 = -4.7
- k1 = 2: -2.7 + -0.5 + -2.0 = -5.2
The log sum of probabilities (to one decimal place) is therefore -4.2. The marginal probabilities for each state at each position can be calculated the same way, and the completed forward matrix will be:
The forward matrix values fk,i were all calculated using a single decimal place of precision. For a real implementation obviously full floating point precision should be used instead for a more accurate result.
At the last position, we have two marginal probabilities; the joint probability of the sequence and all possible state paths given the last state is CpG, and the joint probability given the last state is non-CpG. The sum of those two marginal probabilities is -12.6, and is the sum of all joint probabilities, or the marginal likelihood of the DNA sequence given our HMM.
We previously calculated the log probability of the maximum a posteriori path as -16.2. The posterior probability is therefore exp(-16.2 - -12.6) = exp(-3.6) = 2.7%. The Viterbi result is vaguely plausible (events with 2.7% probability occur all the time) but not reliable (they usually don’t).
For more information see section 3.2 of Biological Sequence analysis by Durbin, Eddy, Krogh and Mitchison.
Update 13 October 2019: for another perspective on the forward algorithm consult Chapter 10 of Bioinformatics Algorithms (2nd or 3rd Edition) by Compeau and Pevzner.
Such probabilities are conditioned on the model being correct. So more accurately, they are the probability of a parameter value being true if the model (e.g. an HMM) used for inference is also the model which generated the data. If the model is wrong, the probabilities can be quite spurious, see Yang and Zhu (2018). ↩