The Viterbi algorithm is used to efficiently infer the most probable “path” of the unobserved random variable in an HMM. In the CpG islands case, this is the most probable combination of CG-rich and CG-poor states over the length of the sequence. In the splicing case, this the most probable structure of the gene in terms of exons and introns.

Conceptually easier than Viterbi would be the brute force solution of calculating the probability for all possible paths. However the number of possible paths for two states, as in the CpG island model, is 2n where n is the number of sites. For even a short sequence of 1000 nucleotides, this equates to 21000 paths, or (very approximately1) 10300. This number is about 10220 times larger than the number of atoms in the observable universe.

I will first demonstrate how the algorithm works by applying the previously described CpG island HMM to the sequence GGCACTGAA. Because this algorithm requires multiplying many probabilities together, we need to be summing log probabilities instead, otherwise we will quickly exhaust the range of floating point numbers. When we convert the linear probabilities in the HMM to log probabilities, the HMM looks like this (to one decimal point):

Appropriate local alignment

Note: the above probabilities are no longer nice round numbers past one decimal point, but for this example we will be imprecise to make it easier to learn the algorithm. Draw up a table and fill in the probabilities of the states when the sequence is empty: 0 log probability (or 100%) for being in the start state at the start of the sequence, and -∞ log probability (or 0%) for not being in the start state at the start of the sequence:

  {} G G C A C T G A A
start 0.0                  
CpG -∞                  
notCpG -∞                  

We will refer to every element of the matrix as vk,i where k is the index of the state, and i is the index within the sequence (starting from the empty sequence at k = 0). vk,i is the maximum joint probability of the sequence and path up to i, conditional on the state of the sequence at i being k:

vk,i = max(P(seq0..i, path0..i-1 | pathi = k)).

This joint probability is equal to the maximum value of vki-1,i-1, plus the emission log probability ek,i of the nucleotide (or amino acid for proteins) at i given k, plus the transition log probability tki-1,k of transitioning from the state ki-1 to k. We find the maximum value by checking each state at the previous position in the sequence, or in other words every value of ki-1.

The transition log probability from any state to the start state is -∞, so for any value of i from 1 onwards, v0,i>0 = -∞. Go ahead and fill those in to save time:

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞                  
notCpG -∞                  

For the next element v1,1 things are more interesting. The maximum joint probability at i = 1 conditioned on k = 1 and k0 will be the maximum value of vk0,0 + e1,1 + tk0,1:

  • k0 = 0: 0.0 + -0.7 + -1.0 = -1.7
  • k0 = 1: -∞ + -0.5 + -1.0 = -∞
  • k0 = 2: -∞ + -1.0 + -1.0 = -∞

The maximum value is -1.7. The value of k0 which maximizes v1,1 is 0, or in computer science terms argmaxk0 = 0. Fill in the maximum value, and also add a pointer to (k0 = 0) = v0,0:

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞ ↖-1.7                
notCpG -∞                  

Now calculate the max and argmax for v2,1:

  • k0 = 0: 0.0 + -0.7 + -2.0 = -2.7
  • k0 = 1: -∞ + -1.0 + -2.0 = -∞
  • k0 = 2: -∞ + -0.5 + -2.0 = -∞

Fill in the max value of -2.7, and add a pointer to the argmax element v0,0 (we will use two arrows to indicate “up two elements”, if drawing by hand it would be easier to draw a single direct arrow):

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞ ↖-1.7                
notCpG -∞ ↖↖-2.7                

For the next empty element v1,2, we cannot transition from the start because it is the second site in the sequence, and k0 = 0 will have no probability. However we have to now consider two possible paths:

  • k1 = 0: -∞ + -0.7 + -1.0 = -∞
  • k1 = 1: -1.7 + -0.5 + -1.0 = -3.2
  • k1 = 2: -2.7 + -1.0 + -1.0 = -4.7

Fill in the max value and a pointer to the argmax element:

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞ ↖-1.7 ←-3.2              
notCpG -∞ ↖↖-2.7                

The next empty element v2,2 is similar:

  • k1 = 0: -∞ + -0.7 + -2.0 = -∞
  • k1 = 1: -1.7 + -1.0 + -2.0 = -4.7
  • k1 = 2: -2.7 + -0.5 + -2.0 = -5.2

Fill in the max value, a pointer to the argmax element, and the rest of the matrix:

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞ ↖-1.7 ←-3.2 ←-4.7 ←-7.2 ←-8.7 ←-11.2 ←-12.7 ←-15.2 ←-17.7
notCpG -∞ ↖↖-2.7 ↖-4.7 ↖-6.2 ↖-6.7 ←-9.2 ↖-10.7 ←-13.2 ↖-14.7 ←-16.2

The maximum joint probability of the sequence and path is the maximum out of vk,L, where L is the length of the sequence. In other words, if we calculate the joint probability

vk,L = max(P(seq0..L, path0..L-1 | pathL = k)).

for every value of kL, we can identify maximum joint probability unconditional on the value of kL. The path is then reconstructed by following the pointers backwards from the maximum joint probability. In this example, the path is:

  {} G G C A C T G A A
start 0.0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
CpG -∞ ↖-1.7 ←-3.2 ←-4.7 ←-7.2 ←-8.7 ←-11.2 ←-12.7 ←-15.2 ←-17.7
notCpG -∞ ↖↖-2.7 ↖-4.7 ↖-6.2 ↖-6.7 ←-9.2 ↖-10.7 ←-13.2 ↖-14.7 ←-16.2

The bold elements indicate the state at each site which is part of the maximum joint probability solution. They form the argmax path to optimize the joint probability.

The basic Viterbi algorithm has a number of important properties:

  • Its space and time complexity is O(Ln) and O(Ln2) respectively, where n is the number of states and L is the length of the sequence
  • It returns a point estimate rather than a probability distribution
  • Like Needleman–Wunsch or Smith–Waterman it is exact, so it is guaranteed to find the optimal2 solution, unlike heuristic algorithms, and unlike an MCMC chain run for a finite number of steps3
  • The probability is the joint probability of the entire sequence (e.g. nucleotides or amino acids) and the entire path of unobserved states. It is not identifying the most probable state at each site, because it is not marginalizing over the remainder of the sequence and path.

If the joint probability is close to sum of all joint probabilities, in other words if there are no other plausible state paths, then the point estimate returned by the algorithm will be reliable. Let’s see how it performs for our example models.

We can use the following implementation of the Viterbi algorithm for CpG islands to find the most probable path through CpG and non-CpG states from data simulated under the same model. Keep in mind that making an estimate using the same model data was simulated under is the best case scenario for inference, and real inference will undoubtedly be less accurate.

import numpy

neginf = float("-inf")

nucleotide_alphabet = "ACGT" # The alphabet of the nucleotide sites
state_alphabet = "SHL" # The alphabet of the states (start, CG High, CG Low)

# The emissions probability matrix. The rows correspond to the three states
# (start, CG-rich, CG-poor). The columns and the probability of emitting
# [A, C, G, T] conditional on the state.
emission_probs = numpy.array([
	[0.00, 0.00, 0.00, 0.00],
	[0.13, 0.37, 0.37, 0.13],
	[0.37, 0.13, 0.13, 0.37]
])

# The transition probabilities matrix. The rows correspond to the state at
# site i - 1, and the columns correspond to the probability of each state at
# site i
transition_probs = numpy.array([
	[0.00, 0.50, 0.50],
	[0.00, 0.63, 0.37],
	[0.00, 0.37, 0.63]
])

emission_log_probs = numpy.log(emission_probs)
transition_log_probs = numpy.log(transition_probs)

sequence = "GCGGCCGATATACTACAGTTGGCTTGAATGAACCAGGCCCGGCGCGGACAACCTGGGCCCCGGCGGTCGCCCATACCGCAATGGGAGCCGTTAACTCGCT"
length = len(sequence) + 1 # the length of the Viterbi matrix

v_matrix = numpy.zeros((length, 3)) # log-probabilities
p_matrix = numpy.zeros((length, 3), dtype = "int8") # pointers

v_matrix.fill(neginf)
p_matrix.fill(-1)

v_matrix[0][0] = 0.0

for i in range(1, length):
	nucleotide = nucleotide_alphabet.index(sequence[i - 1])
	for j in range(3):
		for k in range(3):
			vij = emission_log_probs[j][nucleotide] + v_matrix[i - 1][k] + transition_log_probs[k][j]
			# print(i, j, k, emission_log_probs[j][nucleotide], v_matrix[i - 1][k], transition_log_probs[k][j], vij)
			if vij > v_matrix[i][j]:
				v_matrix[i][j] = vij
				p_matrix[i][j] = k

next_map_k = numpy.argmax(v_matrix[length - 1])
map_states = ""
for i in reversed(range(1, length)):
	map_states = state_alphabet[next_map_k] + map_states
	next_map_k = p_matrix[i][next_map_k]

print(map_states)

The first sequence is the true simulated path of CpG island “H” and non-CpG island “L” states, and the second sequence is the path estimated using the Viterbi algorithm as implemented above:

HHHLHHHLLHLLHHHHLLLLHHHHLHLLLHLLHHHHHHHHHHLLHHHHHLHHHLHHHHLHLHHHHLHHLHHHLHHHHHHHLHHHHLHHHHLLLLHLHHLL
HHHHHHHLLLLLLLLLLLLLHHHLLLLLLLLLHHHHHHHHHHHHHHHHHLLHHHHHHHHHHHHHHHHHHHHHLLLHHHHLLLHHHHHHHHLLLLHHHHHL

Although there is some resemblance between the true sequence and our Viterbi estimate, using the maximum joint probability of the entire sequence and path clearly leads to underestimating the number of transitions between states.

What about splice sites? The following code implements the Viterbi algorithm for the previously inferred gene structure HMM to analyze the Arabidopsis gene FOLB2 (which codes for an enzyme that is part of the folate biosynthesis pathway). This gene has a single intron with canonical splice sites A|G and G|C (the slashes represent where the RNA is cleaved).

# coding=utf8

import numpy

neginf = float("-inf")

nucleotide_alphabet = "ACGT" # The alphabet of the nucleotide sites

# The alphabet of the states:
# 0. start
# 1. exon interior
# 2. exon 3'
# 3. intron 5'
# 4. intron interior
# 5. intron 3'
# 6. exon 5'
state_alphabet = "SEEiiiE"

# The emissions probability matrix. The rows correspond to the seven states
# (start state, exon and intron states). The columns and the probability of emitting
# [A, C, G, T] conditional on the state.
emission_log_probs = numpy.array([
	[neginf, neginf, neginf, neginf],
	[ -1.24,  -1.60,  -1.45,  -1.30],
	[ -2.35,  -3.42,  -0.26,  -2.32],
	[ -6.83, -10.58,   0.00,  -9.60],
	[ -1.28,  -1.84,  -1.84,  -0.90],
	[-10.58,  -6.90,   0.00,  -9.38],
	[ -1.37,  -2.26,  -0.65,  -2.10],
])

# The transition probabilities matrix. The rows correspond to the state at
# site i - 1, and the columns correspond to the probability of each state at
# site i
transition_log_probs = numpy.array([
	[neginf,   0.00, neginf, neginf, neginf, neginf, neginf],
	[neginf,   0.00,  -5.56, neginf, neginf, neginf, neginf],
	[neginf, neginf, neginf,   0.00, neginf, neginf, neginf],
	[neginf, neginf, neginf, neginf,   0.00, neginf, neginf],
	[neginf, neginf, neginf, neginf,  -0.01,  -5.03, neginf],
	[neginf, neginf, neginf, neginf, neginf, neginf,   0.00],
	[neginf,   0.00, neginf, neginf, neginf, neginf, neginf],
])

# FOLB2 sequence
sequence = """
ATGGAGAAAGACATGGCAATGATGGGAGACAAACTGATACTGAGAGGCTTGAAATTTTATGGTTTCCATGGAGCTATTCC
TGAAGAGAAGACGCTTGGCCAGATGTTTATGCTTGACATCGATGCTTGGATGTGTCTCAAAAAGGCTGGTCTATCAGACA
ACTTAGCTGATTCTGTCAGCTATGTCGACATTTACAAGTTAGTTTTAATTACTAATATGAGAGGATTTGCTAGAGATAGT
TAACTAAATTCTCCCCTTTACTCTTGACCAATCCATTTTTATTGTGACCTCATCCAAAAATGACAAGCTTTGCTTATATA
ACAATTTGTCATCACTATCTGTGTCACTGAGTGATGCATTGATTATAGGATATGAAATGATTCTTTGAGATTGAAGATTT
GAAAAGGTTGTGTGTAGGTTATGTAGTAGTGACTACACTTTTCATATGCTGTGTTTGAAACTGTATCATAATTTGTTTTG
GAATGGAATGAATAATCTTAGCGTGGCAAAGGAAGTTGTAGAAGGGTCATCAAGAAACCTTCTGGAGAGAGTTGCAGGAC
TTATAGCTTCCAAAACTCTGGAAATATCCCCTCGGATAACAGCTGTTCGAGTGAAGCTATGGAAGCCAAATGTTGCGCTT
ATTCAAAGCACTATCGATTATTTAGGTGTCGAGATTTTCAGAGATCGCGCAACTGAATAA""".replace("\n", "")
length = len(sequence) + 1 # the length of the viterbi matrix

v_matrix = numpy.zeros((length, 3)) # log-probabilities
p_matrix = numpy.zeros((length, 3), dtype = "int8") # pointers

v_matrix.fill(neginf)
p_matrix.fill(-1)

v_matrix[0][0] = 0.0

for i in range(1, length):
	nucleotide = nucleotide_alphabet.index(sequence[i - 1])
	for j in range(3):
		for k in range(3):
			vij = emission_log_probs[j][nucleotide] + v_matrix[i - 1][k] + transition_log_probs[k][j]
			if vij > v_matrix[i][j]:
				v_matrix[i][j] = vij
				p_matrix[i][j] = k

current_map_k = numpy.argmax(v_matrix[length - 1])
current_status = state_alphabet[current_map_k]

print("maximum joint log probability = %f" % (v_matrix[length - 1][current_map_k]))

map_states = ""
for i in reversed(range(1, length)):
	next_map_k = p_matrix[i][current_map_k]
	next_status = state_alphabet[next_map_k]

	if next_map_k != current_map_k:
		print("Transition from %s to %s at position %d" % (next_status, current_status, i))

	current_map_k = next_map_k
	current_status = next_status

Try running this code on your computer. It prints out any inferred transitions from exon to intron states or vice versa, but if there are no inferred transitions, it will only print out the transition from the start state to an exon state at the very first position. In fact the Viterbi algorithm fails to detect the intron at all in this case. Again it is underestimating the number of transitions between states.

Update 13 October 2019: for another perspective on the Viterbi algorithm, consult Chapter 10 of Bioinformatics Algorithms (2nd or 3rd Edition) by Compeau and Pevzner.

  1. For base 2 numbers, shifting the decimal place to the right by 10 is roughly equivalent to shifting the decimal place to the right by 3 for base 10 numbers. 21000 shifts the decimal place to the right by 10 × 100 places, so we know it is roughly equal to 103×100

  2. Optimal in the sense of finding the true maximum a posteriori (MAP) solution, not in the sense of finding the true path. 

  3. MCMC run for an infinite number of steps should also be exact (conditional on the Markov chain being ergodic). In practice, because life, and possibly the Universe, or at least all the matter in the universe, is finite, MCMC is not guaranteed to find the solution that optimizes some criterion.