Like the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algorithm, the backward algorithm is an instance of dynamic programming where the intermediate values are probabilities.
Recall the forward matrix values can be specified as:
fi,k = P(x1..i,πi=k|M)
That is, the forward matrix contains joint probabilities for the sequence up to the ith position, and the state at that position being k. These joint probabilities are not conditional on the previous states, instead they are marginalizing over the hidden state path leading up to i,k.
In contrast, the backward matrix contains probabilities for the sequence after the ith position, and these probabilities are conditional on the state being k at i:
bi,k = P(xi+1..n|πi=k,M)
To demonstrate the backward algorithm, we will use the same example sequence and HMM as for the Viterbi and forward algorithm demonstrations:
The backward matrix probabilities are marginalized over the hidden state path after i. To calculate them, initialize a backward matrix b of the same dimensions as the corresponding forward matrix. We will work in log space, so use negative infinity for the start state other than at the start position i = 0, and for the non-start states at the start position. The probability of an empty sequence after the last position is 100% regardless of the state at the last position, so fill in zeros for the non-start states at the last column i = n:
To calculate the backward probabilities for a given non-start hidden state k at the second-to-last position i = n - 1 through to the position of the first character i = 1, gather the following log probabilities for each non-start hidden state k’ at position i + 1:
- the emission probability ei+1,k’ of the observed state (character) at i + 1 given k’
- the hidden state transition probability tk,k’ from state k at i to state k’ at i + 1
- the probability bi+1,k’ of the sequence after i + 1 given state k’ at i + 1
The sum of the above log probabilities gives us the probability for the character and hidden state at i + 1, given a particular state at i. Beginning at the second last position i = n - 1, moving from the CG rich state to the CG rich state, the log probabilities are ei+1,k’ = -2, tk,k’ = -0.5 and bi+1,k’ = 0 respectively, and their sum is -2.5. The first two are from the HMM, and the last is from the backward matrix. When moving from the CG rich state to the CG poor state, they are -1, -1 and 0 respectively and the sum is -2.
Finally, we can calculate bi,k by marginalizing over both possible transitions. For the CG rich hidden state at the second-to-last position:
bn-1,CG rich = log(P(xn|πn-1=CG rich,M)) = log(e-2.5 + e-2) = -1.5
Likewise for the CG poor state at position n - 1, the marginal log probability is:
bn-1,CG poor = log(P(xn|πn-1=CG rich,M)) = log(e-3 + e-1.5) = -1.3
We can now update the matrix:
At the start position i = 0, the only valid hidden state is the start state. Therefore at that position we only need to calculate the probability of going from the start state to the CG rich or CG poor states. For moving to the CG rich state, the log probabilities are ei+1,k’ = -1.0, tk,k’ = -0.7 and bi+1,k’ = -11.2. For moving to the CG poor state, they are -2.0, -0.7 and -11.5 respectively. The sums are -12.9 and -14.2 respectively, and the log sum of exponentials is -12.7. We will use this to complete the backward matrix:
Because the only valid hidden state for the start position is the start state, the probability P(xi+1..n|πi=k,M) can be simplified to P(xi+1..n|M). Because the sequence after the start position is the entire sequence, it can be further simplified to P(x|M). In other words, this probability is our marginal likelihood! While this is slightly different from the marginal likelihood of -12.6 derived using the forward algorithm, that is a rounding error caused by our limited precision of one decimal place.
Why do we need two dynamic programming algorithms to compute the marginal likelihood? We don’t! But by combining probabilities from the two matrices, we can derive the posterior probability of each hidden state k at each position i, marginalized over all paths through k at i. How this this work? Let’s use Bayes’ rule to demonstrate:
P(πi=k|x,M) = P(x|πi=k,M) × P(πi=k|M) / P(x|M)
If two variables a and b are independent, their joint probability P(a,b) is simply the product of their probabilities P(a) × P(b). Normally the two segments of the sequence x1..i and xi+1..n are not independent because we are using a hidden Markov model. Under our model, the distribution of characters at a given site is dependent on the hidden state at that site, which in turn is dependent on the hidden state at the previous site.
But by conditioning on the hidden state at a given site i, the sequence after that site xi+1..n is independent of the sequence up to and including i. This is because the hidden state at i is fixed rather than depending on the previous hidden state, or the observed character at i. In other words, while P(x1..i|M) and P(xi+1..n|M) are not independent, P(x1..i|πi=k,M) and P(xi+1..n|πi=k,M) are! Therefore:
P(πi=k|x,M) = P(x1..i|πi=k,M) × P(xi+1..n|πi=k,M) × P(πi=k|M) / P(x|M)
By applying the chain rule, we can take the third term of the expression on the right side of our equation, and fold it into the first term of that expression. This changes the conditional probability to a joint probability:
P(πi=k|x,M) = P(x1..i,πi=k|M) × P(xi+1..n|πi=k,M) / P(x|M)
On the right side of the equation, the first term now corresponds to fi,k, the second term to bi,k, and the third to b0,start. This makes it possible to replace every term on the right side expression with matrix coordinates:
P(πi=k|x,M) = fi,k × bi,k / b0,start
And now we can now “decode” our posterior distribution of hidden states. We need to refer back to the previously calculated forward matrix, shown below.
As an example, let’s solve the posterior probability that the hidden state of the fourth character is CG rich:
P(π4=CG rich|x,M) = f4,CG rich × b4,CG rich / b0,start = e-7.0 × e-6.6 / e-12.7 = 41%
Since we only have two states, given a 41% posterior probability of the CG rich state, the probability of the CG poor state should be 59%, but the rounding errors caused by our lack of precision are causing serious problems:
P(π4=CG poor|x,M) = f4,CG poor × b4,CG poor / b0,start = e-7.2 × e-5.9 / e-12.7 = 67%
Using the above method with a high degree of precision, the posterior probabilities are more precisely calculated as 37% and 63% respectively. The posterior probabilities can be shown as a graph in order to clearly communicate your results:
This gives us a result that reflects the uncertainty of our inference given the limited data at hand. In my opinion, this presentation is more honest than the black-and-white maximum a posteriori result derived using Viterbi’s algorithm.
For another perspective on HMMs, including the Viterbi and Forward-Backward algorithms, consult Chapter 10 of Bioinformatics Algorithms (2nd or 3rd Edition) by Compeau and Pevzner.