# Likelihood of a tree

The likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree topology, branch lengths and substitution model. An efficient dynamic programming algorithm to compute this probability was first developed by Felsenstein in 1973, and is quite similar to the algorithm used to infer unequal-cost parsimony scores developed by Sankoff in 1975.

As with the Sankoff algorithm, a vector is associated with each node of the tree. Each element of the vector stores the probability of observing the tip states, given the tree below the associated node and the state corresponding to the element (the first, second, third and fourth elements usually correspond to A, C, G and T for DNA).

Those probabilities marginalize over all possible states at every internal node below the root of the subtree. These are known as partial likelihoods, and are in contrast with the vector elements of the Sankoff algorithm, which are calculated only from the states which minimize the total cost. We might write the partial likelihood for state \(k\) at node \(n\) as:

\[P_{n,k} = P(D_i|k, T, l, M)\]where \(D_i\) is the tip states at position \(i\) of the multiple sequence alignment or character matrix, \(T\) is the topology of the subtree under the node, \(l\) is the branch lengths of the subtree, and \(M\) is the substitution model. I will go over the five key differences between the two algorithms.

**One.** For the Sankoff algorithm the elements in the vectors at the tips are
initialized to either zero for the observed states or infinity otherwise,
because the only the observed state can be the state at the tips. However
because partial likelihoods are probabilities not costs, for likelihood they
are initialized to 1 for 100% probability (or 0 if working in log space) for
the observed states, and 0 for 0% probability (or negative infinity if
working in log space).

**Two.** Because Felsensteinâ€™s likelihood depends on branch lengths and not
just topology, the transition probabilities must be recomputed for each
branch. For the Jukes-Cantor model just two probabilties are needed because
it assumes equal base frequencies and transition rates. The first is the
probability of state \(k\) at the parent node and state \(k'\) at the child
node being the same **conditioned on** the \(k\):

where $\mu t$ is the product of the substitution rate and length of the branch in time, which is the length of the branch in substitutions per site. And the second is the probability of the state at the child node being different, again conditioned on the state at the parent node:

\[P(k' \ne k|k) = P_{xy} = \frac{1}{4}(1 - e^{-\frac{4}{3}\mu t})\]**Three.** Because the partial likelihoods marginalize over the internal node
states, for each child branch the probabilities for all child node states must
be summed over rather than finding the minimum cost. Using Jukes-Cantor, when
calculating the partial likelihood for state \(k\) at node \(n\), for the one
case where the state \(k'\) at the child node \(c\) equals \(k\), the
probability is \(P_{xx} P_{c,k'}\). For the three cases where it does not,
the probabilities are \(P_{xy}P_{c,k'}\). By summing all four probabilities,
we marginalize over the possible states at that child node.

**Four.** Cost accumulates, but the joint probability of independent
variables multiplies. So for parsimony the cost of the left and right subtrees
under a node (stored in the vectors associated with the left and right
children) and the cost of the mutations along the left and right child
branches (if any) are all added together. But for likelihood the left and
right marginal probabilities are multiplied. Why are left and right marginal
probabilities independent? Because sequences evolve independently along left
and right subtrees, conditioned on the state at the root.

This also applies when calculating the cost or likelihood of a sequence alignment or character matrix. For maximum parsimony the cost accumulates for each additional site, so the parsimony score of an alignment is the sum of minimum costs for each site. But for maximum likelihood the likelihood of each site is a probability and we treat each site as evolving independently, so the likelihood for the alignment is the product of site likelihoods.

**Five.** For maximum parsimony, the smallest element of the root node vector
gives the parsimony score of the tree. But for Felsensteinâ€™s likelihood, we want to
marginalize over root states, i.e. we want \(P(D_i|T,l,M)\) which does not
depend on state \(k\) at the root. Given the RNA alphabet
\(\{A,C,G,U\}\), we can perform this marginalization by summing over the joint
probabilities:

But the partial likelihoods at the root give us \(P(D_i|k, T, l, M)\), where state \(k\) is on the right side of the conditional. We can use the chain rule to convert them to joint probabilities:

\[P(D_i,k|T,l,M) = P(D_i|k,T,l,M) \cdot P(k)\]but what is \(P(k)\)? It is the stationary frequency of the state, which for Jukes-Cantor is always \(\frac{1}{4}\), so for that substitution model we just have to sum the partial likelihoods at the root and divide by four to get the likelihood of the tree.

The following code will calculate the likelihood of a tree (in Newick format) for a multiple sequence alignment (MSA in FASTA format), with the paths to the tree and MSA files given as the first and second arguments to the program.

```
import ete3
import numpy
import os.path
import sys
neginf = float("-inf")
# used by read_fasta to turn a sequence string into a vector of integers based
# on the supplied alphabet
def vectorize_sequence(sequence, alphabet):
sequence_length = len(sequence)
sequence_vector = numpy.zeros(sequence_length, dtype = numpy.uint8)
for i, char in enumerate(sequence):
sequence_vector[i] = alphabet.index(char)
return sequence_vector
# this is a function that reads in a multiple sequence alignment stored in
# FASTA format, and turns it into a matrix
def read_fasta(fasta_path, alphabet):
label_order = []
sequence_matrix = numpy.zeros(0, dtype = numpy.uint8)
fasta_file = open(fasta_path)
l = fasta_file.readline()
while l != "":
l_strip = l.rstrip() # strip out newline characters
if l[0] == ">":
label = l_strip[1:]
label_order.append(label)
else:
sequence_vector = vectorize_sequence(l_strip, alphabet)
sequence_matrix = numpy.concatenate((sequence_matrix, sequence_vector))
l = fasta_file.readline()
fasta_file.close()
n_sequences = len(label_order)
sequence_length = len(sequence_matrix) // n_sequences
sequence_matrix = sequence_matrix.reshape(n_sequences, sequence_length)
return label_order, sequence_matrix
# this is a function that reads in a phylogenetic tree stored in newick
# format, and turns it into an ete3 tree object
def read_newick(newick_path):
newick_file = open(newick_path)
newick = newick_file.read().strip()
newick_file.close()
tree = ete3.Tree(newick)
return tree
def recurse_likelihood(node, site_i, n_states):
if node.is_leaf():
node.partial_likelihoods.fill(0) # reset the leaf likelihoods
leaf_state = node.sequence[site_i]
node.partial_likelihoods[leaf_state] = 1
else:
left_child, right_child = node.get_children()
recurse_likelihood(left_child, site_i, n_states)
recurse_likelihood(right_child, site_i, n_states)
for node_state in range(n_states):
left_partial_likelihood = 0.0
right_partial_likelihood = 0.0
for child_state in range(n_states):
if node_state == child_state:
left_partial_likelihood += left_child.pxx * left_child.partial_likelihoods[child_state]
right_partial_likelihood += right_child.pxx * right_child.partial_likelihoods[child_state]
else:
left_partial_likelihood += left_child.pxy * left_child.partial_likelihoods[child_state]
right_partial_likelihood += right_child.pxy * right_child.partial_likelihoods[child_state]
node.partial_likelihoods[node_state] = left_partial_likelihood * right_partial_likelihood
# nucleotides, obviously
alphabet = "ACGT" # A = 0, C = 1, G = 2, T = 3
n_states = len(alphabet)
# this script requires a newick tree file and fasta sequence file, and
# the paths to those two files are given as arguments to this script
tree_path = sys.argv[1]
root_node = read_newick(tree_path)
msa_path = sys.argv[2]
taxa, alignment = read_fasta(msa_path, alphabet)
site_count = len(alignment[0])
# the number of taxa, and the number of nodes in a rooted phylogeny with that
# number of taxa
n_taxa = len(taxa)
n_nodes = n_taxa + n_taxa - 1
for node in root_node.traverse():
# initialize a vector of partial likelihoods that we can reuse for each site
node.partial_likelihoods = numpy.zeros(n_states)
# we can precalculate the pxx and pxy values for the branch associated with
# this node
node.pxx = (1 / 4) * (1 + 3 * numpy.exp(-(4 / 3) * node.dist))
node.pxy = (1 / 4) * (1 - numpy.exp(-(4 / 3) * node.dist))
# add sequences to leaves
if node.is_leaf():
taxon = node.name
taxon_i = taxa.index(taxon)
node.sequence = alignment[taxon_i]
# this will be the total likelihood of all sites
log_likelihood = 0.0
for site_i in range(site_count):
recurse_likelihood(root_node, site_i, n_states)
# need to multiply the partial likelihoods by the stationary frequencies
# which for Jukes-Cantor is 1/4 for all states
log_likelihood += numpy.log(numpy.sum(root_node.partial_likelihoods * (1 / 4)))
tree_filename = os.path.split(tree_path)[1]
msa_filename = os.path.split(msa_path)[1]
tree_name = os.path.splitext(tree_filename)[0]
msa_name = os.path.splitext(msa_filename)[0]
print("The log likelihood P(%s|%s) = %f" % (msa_name, tree_name, log_likelihood))
```