Likelihood of a tree

7 minute read

The likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree topology, branch lengths and substitution model. An efficient dynamic programming algorithm to compute this probability was first developed by Felsenstein in 1973, and is quite similar to the algorithm used to infer unequal-cost parsimony scores developed by Sankoff in 1975.

As with the Sankoff algorithm, a vector is associated with each node of the tree. Each element of the vector stores the probability of observing the tip states, given the tree below the associated node and the state corresponding to the element (the first, second, third and fourth elements usually correspond to A, C, G and T for DNA).

Those probabilities marginalize over all possible states at every internal node below the root of the subtree. These are known as partial likelihoods, and are in contrast with the vector elements of the Sankoff algorithm, which are calculated only from the states which minimize the total cost. We might write the partial likelihood for state \(k\) at node \(n\) as:

\[P_{n,k} = P(D_i|k, T, l, M)\]

where \(D_i\) is the tip states at position \(i\) of the multiple sequence alignment or character matrix, \(T\) is the topology of the subtree under the node, \(l\) is the branch lengths of the subtree, and \(M\) is the substitution model. I will go over the five key differences between the two algorithms.

One. For the Sankoff algorithm the elements in the vectors at the tips are initialized to either zero for the observed states or infinity otherwise, because the only the observed state can be the state at the tips. However because partial likelihoods are probabilities not costs, for likelihood they are initialized to 1 for 100% probability (or 0 if working in log space) for the observed states, and 0 for 0% probability (or negative infinity if working in log space).

Two. Because Felsenstein’s likelihood depends on branch lengths and not just topology, the transition probabilities must be recomputed for each branch. For the Jukes-Cantor model just two probabilties are needed because it assumes equal base frequencies and transition rates. The first is the probability of state \(k\) at the parent node and state \(k'\) at the child node being the same conditioned on the \(k\):

\[P(k' = k|k) = P_{xx} = \frac{1}{4}(1 + 3 e^{-\frac{4}{3}\mu t})\]

where $\mu t$ is the product of the substitution rate and length of the branch in time, which is the length of the branch in substitutions per site. And the second is the probability of the state at the child node being different, again conditioned on the state at the parent node:

\[P(k' \ne k|k) = P_{xy} = \frac{1}{4}(1 - e^{-\frac{4}{3}\mu t})\]

Three. Because the partial likelihoods marginalize over the internal node states, for each child branch the probabilities for all child node states must be summed over rather than finding the minimum cost. Using Jukes-Cantor, when calculating the partial likelihood for state \(k\) at node \(n\), for the one case where the state \(k'\) at the child node \(c\) equals \(k\), the probability is \(P_{xx} P_{c,k'}\). For the three cases where it does not, the probabilities are \(P_{xy}P_{c,k'}\). By summing all four probabilities, we marginalize over the possible states at that child node.

Four. Cost accumulates, but the joint probability of independent variables multiplies. So for parsimony the cost of the left and right subtrees under a node (stored in the vectors associated with the left and right children) and the cost of the mutations along the left and right child branches (if any) are all added together. But for likelihood the left and right marginal probabilities are multiplied. Why are left and right marginal probabilities independent? Because sequences evolve independently along left and right subtrees, conditioned on the state at the root.

This also applies when calculating the cost or likelihood of a sequence alignment or character matrix. For maximum parsimony the cost accumulates for each additional site, so the parsimony score of an alignment is the sum of minimum costs for each site. But for maximum likelihood the likelihood of each site is a probability and we treat each site as evolving independently, so the likelihood for the alignment is the product of site likelihoods.

Five. For maximum parsimony, the smallest element of the root node vector gives the parsimony score of the tree. But for Felsenstein’s likelihood, we want to marginalize over root states, i.e. we want \(P(D_i|T,l,M)\) which does not depend on state \(k\) at the root. Given the RNA alphabet \(\{A,C,G,U\}\), we can perform this marginalization by summing over the joint probabilities:

\[P(D_i|T,l,M) = P(D_i,k=A|T,l,M) + P(D_i,k=C|T,l,M) + P(D_i,k=G|T,l,M) + P(D_i,k=U|T,l,M)\]

But the partial likelihoods at the root give us \(P(D_i|k, T, l, M)\), where state \(k\) is on the right side of the conditional. We can use the chain rule to convert them to joint probabilities:

\[P(D_i,k|T,l,M) = P(D_i|k,T,l,M) \cdot P(k)\]

but what is \(P(k)\)? It is the stationary frequency of the state, which for Jukes-Cantor is always \(\frac{1}{4}\), so for that substitution model we just have to sum the partial likelihoods at the root and divide by four to get the likelihood of the tree.

The following code will calculate the likelihood of a tree (in Newick format) for a multiple sequence alignment (MSA in FASTA format), with the paths to the tree and MSA files given as the first and second arguments to the program.

import ete3
import numpy
import os.path
import sys

neginf = float("-inf")

# used by read_fasta to turn a sequence string into a vector of integers based
# on the supplied alphabet
def vectorize_sequence(sequence, alphabet):
	sequence_length = len(sequence)

	sequence_vector = numpy.zeros(sequence_length, dtype = numpy.uint8)

	for i, char in enumerate(sequence):
		sequence_vector[i] = alphabet.index(char)

	return sequence_vector

# this is a function that reads in a multiple sequence alignment stored in
# FASTA format, and turns it into a matrix
def read_fasta(fasta_path, alphabet):
	label_order = []
	sequence_matrix = numpy.zeros(0, dtype = numpy.uint8)

	fasta_file = open(fasta_path)

	l = fasta_file.readline()
	while l != "":
		l_strip = l.rstrip() # strip out newline characters

		if l[0] == ">":
			label = l_strip[1:]
			label_order.append(label)
		else:
			sequence_vector = vectorize_sequence(l_strip, alphabet)
			sequence_matrix = numpy.concatenate((sequence_matrix, sequence_vector))

		l = fasta_file.readline()

	fasta_file.close()

	n_sequences = len(label_order)
	sequence_length = len(sequence_matrix) // n_sequences
	sequence_matrix = sequence_matrix.reshape(n_sequences, sequence_length)

	return label_order, sequence_matrix

# this is a function that reads in a phylogenetic tree stored in newick
# format, and turns it into an ete3 tree object
def read_newick(newick_path):
	newick_file = open(newick_path)
	newick = newick_file.read().strip()
	newick_file.close()

	tree = ete3.Tree(newick)

	return tree

def recurse_likelihood(node, site_i, n_states):
	if node.is_leaf():
		node.partial_likelihoods.fill(0) # reset the leaf likelihoods
		leaf_state = node.sequence[site_i]
		node.partial_likelihoods[leaf_state] = 1
	else:
		left_child, right_child = node.get_children()

		recurse_likelihood(left_child, site_i, n_states)
		recurse_likelihood(right_child, site_i, n_states)

		for node_state in range(n_states):
			left_partial_likelihood = 0.0
			right_partial_likelihood = 0.0
			for child_state in range(n_states):
				if node_state == child_state:
					left_partial_likelihood += left_child.pxx * left_child.partial_likelihoods[child_state]
					right_partial_likelihood += right_child.pxx * right_child.partial_likelihoods[child_state]
				else:
					left_partial_likelihood += left_child.pxy * left_child.partial_likelihoods[child_state]
					right_partial_likelihood += right_child.pxy * right_child.partial_likelihoods[child_state]

			node.partial_likelihoods[node_state] = left_partial_likelihood * right_partial_likelihood

# nucleotides, obviously
alphabet = "ACGT" # A = 0, C = 1, G = 2, T = 3
n_states = len(alphabet)

# this script requires a newick tree file and fasta sequence file, and
# the paths to those two files are given as arguments to this script

tree_path = sys.argv[1]
root_node = read_newick(tree_path)

msa_path = sys.argv[2]
taxa, alignment = read_fasta(msa_path, alphabet)
site_count = len(alignment[0])

# the number of taxa, and the number of nodes in a rooted phylogeny with that
# number of taxa
n_taxa = len(taxa)
n_nodes = n_taxa + n_taxa - 1

for node in root_node.traverse():
	# initialize a vector of partial likelihoods that we can reuse for each site
	node.partial_likelihoods = numpy.zeros(n_states)

	# we can precalculate the pxx and pxy values for the branch associated with
	# this node
	node.pxx = (1 / 4) * (1 + 3 * numpy.exp(-(4 / 3) * node.dist))
	node.pxy = (1 / 4) * (1 - numpy.exp(-(4 / 3) * node.dist))

	# add sequences to leaves
	if node.is_leaf():
		taxon = node.name
		taxon_i = taxa.index(taxon)
		node.sequence = alignment[taxon_i]

# this will be the total likelihood of all sites
log_likelihood = 0.0

for site_i in range(site_count):
	recurse_likelihood(root_node, site_i, n_states)

	# need to multiply the partial likelihoods by the stationary frequencies
	# which for Jukes-Cantor is 1/4 for all states
	log_likelihood += numpy.log(numpy.sum(root_node.partial_likelihoods * (1 / 4)))

tree_filename = os.path.split(tree_path)[1]
msa_filename = os.path.split(msa_path)[1]

tree_name = os.path.splitext(tree_filename)[0]
msa_name = os.path.splitext(msa_filename)[0]

print("The log likelihood P(%s|%s) = %f" % (msa_name, tree_name, log_likelihood))

Categories:

Updated: