# Backward algorithm

Like the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algorithm, the backward algorithm is an instance of dynamic programming where the intermediate values are probabilities.

Recall the forward matrix values can be specified as:

f_{i,k} = P(x_{1..i},π_{i}=k|M)

That is, the forward matrix contains joint probabilities for the sequence
up to the *i*^{th} position, and the state at that position being *k*.
These joint probabilities are not conditional on the previous states, instead
they are marginalizing over the hidden state path leading up to *i*,*k*.

In contrast, the backward matrix contains probabilities for the sequence
*after* the *i*^{th} position, and these probabilities are conditional
on the state being *k* at *i*:

b_{i,k} = P(x_{i+1..n}|π_{i}=k,M)

To demonstrate the backward algorithm, we will use the same example sequence and HMM as for the Viterbi and forward algorithm demonstrations:

The backward matrix probabilities are marginalized over the hidden state path
after *i*. To calculate them, initialize a backward matrix *b* of the same
dimensions as the corresponding forward matrix. We will work in log space, so
use negative infinity for the start state other than at the start position *i*
= 0, and for the non-start states at the start position. The probability of an
empty sequence after the last position is 100% regardless of the state at the
last position, so fill in zeros for the non-start states at the last column
*i* = *n*:

{} | G | G | C | A | C | T | G | A | A | |
---|---|---|---|---|---|---|---|---|---|---|

start | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | |

CG rich | -∞ | 0.0 | ||||||||

CG poor | -∞ | 0.0 |

To calculate the backward probabilities for a given non-start hidden state *k* at the
second-to-last position *i* = *n* - 1 through to the position of the first character *i* = 1,
gather the following log probabilities for each non-start hidden state *k’* at position *i* + 1:

- the emission probability e
_{i+1,k’}of the observed state (character) at*i*+ 1 given*k’* - the hidden state transition probability t
_{k,k’}from state*k*at*i*to state*k’*at*i*+ 1 - the probability
*b*_{i+1,k’}of the sequence after*i*+ 1 given state*k’*at*i*+ 1

The sum of the above log probabilities gives us the probability for the
character and hidden state at *i* + 1, given a particular state at *i*.
Beginning at the second last position *i* = *n* - 1, moving from the CG rich state to the CG rich
state, the log probabilities are e_{i+1,k’} = -2, t_{k,k’} = -0.5 and *b*_{i+1,k’} = 0 respectively, and their sum is
-2.5. The first two are from the HMM, and the last is from the
backward matrix. When moving from the CG rich state to the CG poor state, they
are -1, -1 and 0 respectively and the sum is -2.

Finally, we can calculate *b*_{i,k} by marginalizing over both possible
transitions. For the CG rich hidden state at the second-to-last position:

*b*_{n-1,CG rich} = log(P(x_{n}|π_{n-1}=CG rich,M)) = log(e^{-2.5} + e^{-2}) = -1.5

Likewise for the CG poor state at position *n* - 1, the marginal log probability is:

*b*_{n-1,CG poor} = log(P(x_{n}|π_{n-1}=CG rich,M)) = log(e^{-3} + e^{-1.5}) = -1.3

We can now update the matrix:

{} | G | G | C | A | C | T | G | A | A | |
---|---|---|---|---|---|---|---|---|---|---|

start | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | |

CG rich | -∞ | -11.2 | -9.9 | -8.6 | -7.0 | -5.7 | -4.1 | -2.9 | -1.5 | 0.0 |

CG poor | -∞ | -11.5 | -10.1 | -8.5 | -7.2 | -5.6 | -4.3 | -2.6 | -1.3 | 0.0 |

At the start position *i* = 0, the only valid hidden state is the start state.
Therefore at that position we only need to calculate the probability of going
from the start state to the CG rich or CG poor states. For moving to the CG rich state,
the log probabilities are e_{i+1,k’} = -1.0, t_{k,k’} = -0.7 and *b*_{i+1,k’} = -11.2.
For moving to the CG poor state, they are -2.0, -0.7 and -11.5 respectively.
The sums are -12.9 and -14.2 respectively, and the log sum of exponentials
is -12.7. We will use this to complete the backward matrix:

{} | G | G | C | A | C | T | G | A | A | |
---|---|---|---|---|---|---|---|---|---|---|

start | -12.7 | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ |

CG rich | -∞ | -11.2 | -9.9 | -8.6 | -7.0 | -5.7 | -4.1 | -2.9 | -1.5 | 0.0 |

CG poor | -∞ | -11.5 | -10.1 | -8.5 | -7.2 | -5.6 | -4.3 | -2.6 | -1.3 | 0.0 |

Because the only valid hidden state for the start position is the start state,
the probability P(x_{i+1..n}|π_{i}=k,M) can be simplified to
P(x_{i+1..n}|M). Because the sequence after the start position is the
entire sequence, it can be further simplified to P(x|M). In other words, this
probability is our marginal likelihood! While this is slightly different from
the marginal likelihood of -12.6 derived using the forward algorithm, that is
a rounding error caused by our limited precision of one decimal place.

Why do we need two dynamic programming algorithms to compute the marginal
likelihood? We don’t! But by combining probabilities from the two matrices, we
can derive the posterior probability of each hidden state *k* at each position
*i*, marginalized over all paths through *k* at *i*. How this this work? Let’s
use Bayes’ rule to demonstrate:

P(π_{i}=*k*|x,M) = P(x|π_{i}=*k*,M) × P(π_{i}=*k*|M) / P(x|M)

If two variables *a* and *b* are independent, their joint probability
P(*a*,*b*) is simply the product of their probabilities P(*a*) × P(*b*).
Normally the two segments of the sequence x_{1..i} and
x_{i+1..n} are not independent because we are using a hidden Markov
model. Under our model, the distribution of characters at a given site is
dependent on the hidden state at that site, which in turn is dependent on the
hidden state at the previous site.

But by conditioning on the hidden state at a given site *i*, the sequence
after that site x_{i+1..n} is independent of the sequence up to and
including *i*. This is because the hidden state at *i* is fixed rather than depending
on the previous hidden state, or the observed character at *i*. In other
words, while P(x_{1..i}|M) and P(x_{i+1..n}|M) are not
independent, P(x_{1..i}|π_{i}=*k*,M) and
P(x_{i+1..n}|π_{i}=*k*,M) are! Therefore:

P(π_{i}=*k*|x,M) = P(x_{1..i}|π_{i}=*k*,M) × P(x_{i+1..n}|π_{i}=*k*,M) × P(π_{i}=*k*|M) / P(x|M)

By applying the chain rule, we can take the third term of the expression on the right side of our equation, and fold it into the first term of that expression. This changes the conditional probability to a joint probability:

P(π_{i}=*k*|x,M) = P(x_{1..i},π_{i}=*k*|M) × P(x_{i+1..n}|π_{i}=*k*,M) / P(x|M)

On the right side of the equation, the first term now corresponds to *f*_{i,k},
the second term to *b*_{i,k}, and the third to *b*_{0,start}. This
makes it possible to replace every term on the right side expression with matrix coordinates:

P(π_{i}=*k*|x,M) = *f*_{i,k} × *b*_{i,k} / *b*_{0,start}

And now we can now “decode” our posterior distribution of hidden states. We need to refer back to the previously calculated forward matrix, shown below.

{} | G | G | C | A | C | T | G | A | A | |
---|---|---|---|---|---|---|---|---|---|---|

start | 0.0 | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ | -∞ |

CG rich | -∞ | -1.7 | -3.0 | -4.3 | -6.6 | -7.3 | -9.6 | -10.2 | -12.5 | -14.1 |

CG poor | -∞ | -2.7 | -4.2 | -5.6 | -5.9 | -8.1 | -8.7 | -11.0 | -11.6 | -12.9 |

As an example, let’s solve the posterior probability that the hidden state of the fourth character is CG rich:

P(π_{4}=*CG rich*|x,M) = *f*_{4,CG rich} × *b*_{4,CG rich} / *b*_{0,start} = e^{-7.0} × e^{-6.6} / e^{-12.7} = 41%

Since we only have two states, given a 41% posterior probability of the CG rich state, the probability of the CG poor state should be 59%, but the rounding errors caused by our lack of precision are causing serious problems:

P(π_{4}=*CG poor*|x,M) = *f*_{4,CG poor} × *b*_{4,CG poor} / *b*_{0,start} = e^{-7.2} × e^{-5.9} / e^{-12.7} = 67%

Using the above method with a high degree of precision, the posterior probabilities are more precisely calculated as 37% and 63% respectively. The posterior probabilities can be shown as a graph in order to clearly communicate your results:

This gives us a result that reflects the uncertainty of our inference given
the limited data at hand. In my opinion, this presentation is more honest than
the black-and-white maximum *a posteriori* result derived using Viterbi’s
algorithm.

For another perspective on HMMs, including the Viterbi and Forward-Backward algorithms, consult Chapter 10 of Bioinformatics Algorithms (2nd or 3rd Edition) by Compeau and Pevzner.