Viterbi algorithm
The Viterbi algorithm is used to efficiently infer the most probable “path” of the unobserved random variable in an HMM. In the CpG islands case, this is the most probable combination of CGrich and CGpoor states over the length of the sequence. In the splicing case, this the most probable structure of the gene in terms of exons and introns.
Conceptually easier than Viterbi would be the brute force solution of calculating the probability for all possible paths. However the number of possible paths for two states, as in the CpG island model, is 2^{n} where n is the number of sites. For even a short sequence of 1000 nucleotides, this equates to 2^{1000} paths, or (very approximately^{1}) 10^{300}. This number is about 10^{220} times larger than the number of atoms in the observable universe.
I will first demonstrate how the algorithm works by applying the previously
described CpG island HMM to the sequence GGCACTGAA
. Because this
algorithm requires multiplying many probabilities together, we need to be
summing log probabilities instead, otherwise we will quickly exhaust the range
of floating point numbers. When we convert the linear probabilities in the HMM
to log probabilities, the HMM looks like this (to one decimal point):
Note: the above probabilities are no longer nice round numbers past one decimal point, but for this example we will be imprecise to make it easier to learn the algorithm. Draw up a table and fill in the probabilities of the states when the sequence is empty: 0 log probability (or 100%) for being in the start state at the start of the sequence, and ∞ log probability (or 0%) for not being in the start state at the start of the sequence:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  
CpG  ∞  
notCpG  ∞ 
We will refer to every element of the matrix as v_{k,i} where k is the index of the state, and i is the index within the sequence (starting from the empty sequence at k = 0). v_{k,i} is the maximum joint probability of the sequence and path up to i, conditional on the state of the sequence at i being k:
v_{k,i} = max(P(seq_{0..i}, path_{0..i1}  path_{i} = k)).
This joint probability is equal to the maximum value of v_{ki1,i1}, plus the emission log probability e_{k,i} of the nucleotide (or amino acid for proteins) at i given k, plus the transition log probability t_{ki1,k} of transitioning from the state k_{i1} to k. We find the maximum value by checking each state at the previous position in the sequence, or in other words every value of k_{i1}.
The transition log probability from any state to the start state is ∞, so for any value of i from 1 onwards, v_{0,i>0} = ∞. Go ahead and fill those in to save time:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  
notCpG  ∞ 
For the next element v_{1,1} things are more interesting. The maximum joint probability at i = 1 conditioned on k = 1 and k_{0} will be the maximum value of v_{k0,0} + e_{1,1} + t_{k0,1}:
 k_{0} = 0: 0.0 + 0.7 + 1.0 = 1.7
 k_{0} = 1: ∞ + 0.5 + 1.0 = ∞
 k_{0} = 2: ∞ + 1.0 + 1.0 = ∞
The maximum value is 1.7. The value of k_{0} which maximizes v_{1,1} is 0, or in computer science terms argmax_{k0} = 0. Fill in the maximum value, and also add a pointer to (k_{0} = 0) = v_{0,0}:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  ↖1.7  
notCpG  ∞ 
Now calculate the max and argmax for v_{2,1}:
 k_{0} = 0: 0.0 + 0.7 + 2.0 = 2.7
 k_{0} = 1: ∞ + 1.0 + 2.0 = ∞
 k_{0} = 2: ∞ + 0.5 + 2.0 = ∞
Fill in the max value of 2.7, and add a pointer to the argmax element v_{0,0} (we will use two arrows to indicate “up two elements”, if drawing by hand it would be easier to draw a single direct arrow):
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  ↖1.7  
notCpG  ∞  ↖↖2.7 
For the next empty element v_{1,2}, we cannot transition from the start because it is the second site in the sequence, and k_{0} = 0 will have no probability. However we have to now consider two possible paths:
 k_{1} = 0: ∞ + 0.7 + 1.0 = ∞
 k_{1} = 1: 1.7 + 0.5 + 1.0 = 3.2
 k_{1} = 2: 2.7 + 1.0 + 1.0 = 4.7
Fill in the max value and a pointer to the argmax element:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  ↖1.7  ←3.2  
notCpG  ∞  ↖↖2.7 
The next empty element v_{2,2} is similar:
 k_{1} = 0: ∞ + 0.7 + 2.0 = ∞
 k_{1} = 1: 1.7 + 1.0 + 2.0 = 4.7
 k_{1} = 2: 2.7 + 0.5 + 2.0 = 5.2
Fill in the max value, a pointer to the argmax element, and the rest of the matrix:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  ↖1.7  ←3.2  ←4.7  ←7.2  ←8.7  ←11.2  ←12.7  ←15.2  ←17.7 
notCpG  ∞  ↖↖2.7  ↖4.7  ↖6.2  ↖6.7  ←9.2  ↖10.7  ←13.2  ↖14.7  ←16.2 
The maximum joint probability of the sequence and path is the maximum out of v_{k,L}, where L is the length of the sequence. In other words, if we calculate the joint probability
v_{k,L} = max(P(seq_{0..L}, path_{0..L1}  path_{L} = k)).
for every value of k_{L}, we can identify maximum joint probability unconditional on the value of k_{L}. The path is then reconstructed by following the pointers backwards from the maximum joint probability. In this example, the path is:
{}  G  G  C  A  C  T  G  A  A  

start  0.0  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞  ∞ 
CpG  ∞  ↖1.7  ←3.2  ←4.7  ←7.2  ←8.7  ←11.2  ←12.7  ←15.2  ←17.7 
notCpG  ∞  ↖↖2.7  ↖4.7  ↖6.2  ↖6.7  ←9.2  ↖10.7  ←13.2  ↖14.7  ←16.2 
The bold elements indicate the state at each site which is part of the maximum joint probability solution. They form the argmax path to optimize the joint probability.
The basic Viterbi algorithm has a number of important properties:
 Its space and time complexity is O(Ln) and O(Ln^{2}) respectively, where n is the number of states and L is the length of the sequence
 It returns a point estimate rather than a probability distribution
 Like Needleman–Wunsch or Smith–Waterman it is exact, so it is guaranteed to find the optimal^{2} solution, unlike heuristic algorithms, and unlike an MCMC chain run for a finite number of steps^{3}
 The probability is the joint probability of the entire sequence (e.g. nucleotides or amino acids) and the entire path of unobserved states. It is not identifying the most probable state at each site, because it is not marginalizing over the remainder of the sequence and path.
If the joint probability is close to sum of all joint probabilities, in other words if there are no other plausible state paths, then the point estimate returned by the algorithm will be reliable. Let’s see how it performs for our example models.
We can use the following implementation of the Viterbi algorithm for CpG islands to find the most probable path through CpG and nonCpG states from data simulated under the same model. Keep in mind that making an estimate using the same model data was simulated under is the best case scenario for inference, and real inference will undoubtedly be less accurate.
import numpy
neginf = float("inf")
nucleotide_alphabet = "ACGT" # The alphabet of the nucleotide sites
state_alphabet = "SHL" # The alphabet of the states (start, CG High, CG Low)
# The emissions probability matrix. The rows correspond to the three states
# (start, CGrich, CGpoor). The columns and the probability of emitting
# [A, C, G, T] conditional on the state.
emission_probs = numpy.array([
[0.00, 0.00, 0.00, 0.00],
[0.13, 0.37, 0.37, 0.13],
[0.37, 0.13, 0.13, 0.37]
])
# The transition probabilities matrix. The rows correspond to the state at
# site i  1, and the columns correspond to the probability of each state at
# site i
transition_probs = numpy.array([
[0.00, 0.50, 0.50],
[0.00, 0.63, 0.37],
[0.00, 0.37, 0.63]
])
emission_log_probs = numpy.log(emission_probs)
transition_log_probs = numpy.log(transition_probs)
sequence = "GCGGCCGATATACTACAGTTGGCTTGAATGAACCAGGCCCGGCGCGGACAACCTGGGCCCCGGCGGTCGCCCATACCGCAATGGGAGCCGTTAACTCGCT"
length = len(sequence) + 1 # the length of the Viterbi matrix
v_matrix = numpy.zeros((length, 3)) # logprobabilities
p_matrix = numpy.zeros((length, 3), dtype = "int8") # pointers
v_matrix.fill(neginf)
p_matrix.fill(1)
v_matrix[0][0] = 0.0
for i in range(1, length):
nucleotide = nucleotide_alphabet.index(sequence[i  1])
for j in range(3):
for k in range(3):
vij = emission_log_probs[j][nucleotide] + v_matrix[i  1][k] + transition_log_probs[k][j]
# print(i, j, k, emission_log_probs[j][nucleotide], v_matrix[i  1][k], transition_log_probs[k][j], vij)
if vij > v_matrix[i][j]:
v_matrix[i][j] = vij
p_matrix[i][j] = k
next_map_k = numpy.argmax(v_matrix[length  1])
map_states = ""
for i in reversed(range(1, length)):
map_states = state_alphabet[next_map_k] + map_states
next_map_k = p_matrix[i][next_map_k]
print(map_states)
The first sequence is the true simulated path of CpG island “H” and nonCpG island “L” states, and the second sequence is the path estimated using the Viterbi algorithm as implemented above:
HHHLHHHLLHLLHHHHLLLLHHHHLHLLLHLLHHHHHHHHHHLLHHHHHLHHHLHHHHLHLHHHHLHHLHHHLHHHHHHHLHHHHLHHHHLLLLHLHHLL
HHHHHHHLLLLLLLLLLLLLHHHLLLLLLLLLHHHHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHHHHHHHHLLLHHHHLLLHHHHHHHHLLLLHHHHHL
Although there is some resemblance between the true sequence and our Viterbi estimate, using the maximum joint probability of the entire sequence and path clearly leads to underestimating the number of transitions between states.
What about splice sites? The following code implements the Viterbi algorithm for the previously inferred gene structure HMM to analyze the Arabidopsis gene FOLB2 (which codes for an enzyme that is part of the folate biosynthesis pathway). This gene has a single intron with canonical splice sites AG and GC (the slashes represent where the RNA is cleaved).
# coding=utf8
import numpy
neginf = float("inf")
nucleotide_alphabet = "ACGT" # The alphabet of the nucleotide sites
# The alphabet of the states:
# 0. start
# 1. exon interior
# 2. exon 3'
# 3. intron 5'
# 4. intron interior
# 5. intron 3'
# 6. exon 5'
state_alphabet = "SEEiiiE"
# The emissions probability matrix. The rows correspond to the seven states
# (start state, exon and intron states). The columns and the probability of emitting
# [A, C, G, T] conditional on the state.
emission_log_probs = numpy.array([
[neginf, neginf, neginf, neginf],
[ 1.24, 1.60, 1.45, 1.30],
[ 2.35, 3.42, 0.26, 2.32],
[ 6.83, 10.58, 0.00, 9.60],
[ 1.28, 1.84, 1.84, 0.90],
[10.58, 6.90, 0.00, 9.38],
[ 1.37, 2.26, 0.65, 2.10],
])
# The transition probabilities matrix. The rows correspond to the state at
# site i  1, and the columns correspond to the probability of each state at
# site i
transition_log_probs = numpy.array([
[neginf, 0.00, neginf, neginf, neginf, neginf, neginf],
[neginf, 0.00, 5.56, neginf, neginf, neginf, neginf],
[neginf, neginf, neginf, 0.00, neginf, neginf, neginf],
[neginf, neginf, neginf, neginf, 0.00, neginf, neginf],
[neginf, neginf, neginf, neginf, 0.01, 5.03, neginf],
[neginf, neginf, neginf, neginf, neginf, neginf, 0.00],
[neginf, 0.00, neginf, neginf, neginf, neginf, neginf],
])
# FOLB2 sequence
sequence = """
ATGGAGAAAGACATGGCAATGATGGGAGACAAACTGATACTGAGAGGCTTGAAATTTTATGGTTTCCATGGAGCTATTCC
TGAAGAGAAGACGCTTGGCCAGATGTTTATGCTTGACATCGATGCTTGGATGTGTCTCAAAAAGGCTGGTCTATCAGACA
ACTTAGCTGATTCTGTCAGCTATGTCGACATTTACAAGTTAGTTTTAATTACTAATATGAGAGGATTTGCTAGAGATAGT
TAACTAAATTCTCCCCTTTACTCTTGACCAATCCATTTTTATTGTGACCTCATCCAAAAATGACAAGCTTTGCTTATATA
ACAATTTGTCATCACTATCTGTGTCACTGAGTGATGCATTGATTATAGGATATGAAATGATTCTTTGAGATTGAAGATTT
GAAAAGGTTGTGTGTAGGTTATGTAGTAGTGACTACACTTTTCATATGCTGTGTTTGAAACTGTATCATAATTTGTTTTG
GAATGGAATGAATAATCTTAGCGTGGCAAAGGAAGTTGTAGAAGGGTCATCAAGAAACCTTCTGGAGAGAGTTGCAGGAC
TTATAGCTTCCAAAACTCTGGAAATATCCCCTCGGATAACAGCTGTTCGAGTGAAGCTATGGAAGCCAAATGTTGCGCTT
ATTCAAAGCACTATCGATTATTTAGGTGTCGAGATTTTCAGAGATCGCGCAACTGAATAA""".replace("\n", "")
length = len(sequence) + 1 # the length of the viterbi matrix
v_matrix = numpy.zeros((length, 3)) # logprobabilities
p_matrix = numpy.zeros((length, 3), dtype = "int8") # pointers
v_matrix.fill(neginf)
p_matrix.fill(1)
v_matrix[0][0] = 0.0
for i in range(1, length):
nucleotide = nucleotide_alphabet.index(sequence[i  1])
for j in range(3):
for k in range(3):
vij = emission_log_probs[j][nucleotide] + v_matrix[i  1][k] + transition_log_probs[k][j]
if vij > v_matrix[i][j]:
v_matrix[i][j] = vij
p_matrix[i][j] = k
current_map_k = numpy.argmax(v_matrix[length  1])
current_status = state_alphabet[current_map_k]
print("maximum joint log probability = %f" % (v_matrix[length  1][current_map_k]))
map_states = ""
for i in reversed(range(1, length)):
next_map_k = p_matrix[i][current_map_k]
next_status = state_alphabet[next_map_k]
if next_map_k != current_map_k:
print("Transition from %s to %s at position %d" % (next_status, current_status, i))
current_map_k = next_map_k
current_status = next_status
Try running this code on your computer. It prints out any inferred transitions from exon to intron states or vice versa, but if there are no inferred transitions, it will only print out the transition from the start state to an exon state at the very first position. In fact the Viterbi algorithm fails to detect the intron at all in this case. Again it is underestimating the number of transitions between states.
Update 13 October 2019: for another perspective on the Viterbi algorithm, consult Chapter 10 of Bioinformatics Algorithms (2nd or 3rd Edition) by Compeau and Pevzner.

For base 2 numbers, shifting the decimal place to the right by 10 is roughly equivalent to shifting the decimal place to the right by 3 for base 10 numbers. 2^{1000} shifts the decimal place to the right by 10 × 100 places, so we know it is roughly equal to 10^{3×100}. ↩

Optimal in the sense of finding the true maximum a posteriori (MAP) solution, not in the sense of finding the true path. ↩

MCMC run for an infinite number of steps should also be exact (conditional on the Markov chain being ergodic). In practice, because life, and possibly the Universe, or at least all the matter in the universe, is finite, MCMC is not guaranteed to find the solution that optimizes some criterion. ↩