Jekyll2019-12-08T17:36:06+11:00http://www.cs.rice.edu/~ogilvie/feed.xmlSpecies and Gene EvolutionThis is my Rice University web site, where I will share information about my research and teaching.Huw A. OgilvieCalculating the likelihood for an ultrametric tree (example)2019-12-04T16:00:00+11:002019-12-04T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/04/ultrametric-likelihood-example<p>In this example we will calculate the likelihood <script type="math/tex">P(D|T,h)</script> where <script type="math/tex">D</script> is a
single site, <script type="math/tex">T</script> is a rooted tree topology, and <script type="math/tex">h</script> is the node heights
for the tree topology. Since we are using node heights instead of branch
lengths, and if we make the node heights at the tips all zeros, the tree is
necessarily ultrametric. The site pattern, topology and branch lengths
correspond to the following tree:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-0.png" alt="Example ultrametric tree" /></p>
<p>The node heights <script type="math/tex">\tau</script> are given in some unit of time <script type="math/tex">t</script> before present.
As long as the substitution rate <script type="math/tex">\mu</script> is constant across the tree (i.e. we
are assuming a strict molecular clock), there are three unique branch lengths
<script type="math/tex">l = \mu t</script> in expected substitutions per site. In this example we assume
a constant rate <script type="math/tex">\mu = 0.1</script>.</p>
<p>The branch lengths of humans and chimps in substitutions per site are
both 0.1, the branch length of the ancestor of humans and chimps (HC) is 0.2,
and the branch length of gorillas is 0.3. We will calculate the likelihood
under the Jukes–Cantor model, so we only have to calculate the probability of
the state being the same by the end of a branch (e.g. A to A), and the
probability of the state being something else (e.g. A to C), given the state
at the beginning and the branch length.</p>
<p>For the human and chimp branches, these will be (to four decimal places):</p>
<script type="math/tex; mode=display">P_{xx}(0.1) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.1}\right) = 0.9064</script>
<script type="math/tex; mode=display">P_{xy}(0.1) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.1}\right) = 0.0312</script>
<p>For the HC branch, these will be:</p>
<script type="math/tex; mode=display">P_{xx}(0.2) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.2}\right) = 0.8245</script>
<script type="math/tex; mode=display">P_{xy}(0.2) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.2}\right) = 0.0585</script>
<p>For the gorilla branch, these will be:</p>
<script type="math/tex; mode=display">P_{xx}(0.3) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.3}\right) = 0.7528</script>
<script type="math/tex; mode=display">P_{xy}(0.3) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.3}\right) = 0.0824</script>
<p>For the tip nodes, the partial likelihoods are 1 for the observed states, and
0 otherwise:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-1.png" alt="Tip partial likelihoods" /></p>
<p>For each internal node we have to consider the left and right children
separately. We will start off by calculating the partial likelihood of state A
of the HC internal node. Beginning with the left child (humans), the
probability of the child state being A is the probability of the end state
being the same (as calculated above), multiplied by the partial likelihood of the
child state. This is <script type="math/tex">0.9064 \times 1 = 0.9064</script>. For child states C, G and
T, the probabilities will be <script type="math/tex">0.0312 \times 0 = 0</script>, so the probability for
the left child branch integrating over all child states is
<script type="math/tex">0.9064 + 0 + 0 + 0 = 0.9064</script>.</p>
<p>The right child (chimpanzees) has the same branch length and partial
likelihoods, so its probability will also be <script type="math/tex">0.9064</script>, and the partial
likelihood of state A for the HC node will be <script type="math/tex">0.9064 \times 0.9064 =
0.8215</script>. We use the product because we want to calculate the probability of
the left <strong>and</strong> right subtree states.</p>
<p>For state C in the HC node, the probability along the left branch for child
state A will be <script type="math/tex">0.0312 \times 1 = 0.0312</script>. The probability for state C will
be <script type="math/tex">0.9064 \times 0 = 0</script>, and for states G and T will be <script type="math/tex">0.0312 \times 0 =
0</script>. So the probability for the left branch integrating over child states will
be <script type="math/tex">0.0312</script>. Again the right branch will be the same, so the partial likelihood
of state C will be <script type="math/tex">0.0312 \times 0.0312 = 0.00097</script></p>
<p>Because of the equal base frequencies and equal rates assumption in
Jukes–Cantor, the partial likelihoods of G and T will be the same as for C.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-2.png" alt="Human--chimp partial likelihoods" /></p>
<p>Now for state A at the root, the probability along the left branch for child
state A will be the probability of the state remaining the same given a branch
length of 0.2, multiplied by the partial likelihood of state A for the HC
node, or <script type="math/tex">0.8245 \times 0.8215 = 0.6773</script>. For child states C, G and T it
will be <script type="math/tex">0.0585 \times 0.00097 = 0.000057</script>, which is the probability of the state
being different at the end given a branch length of 0.2 multiplied by the
partial likelihoods. So the probability along the left branch for state A at
the root integrating over the left child states will be <script type="math/tex">0.6773 + 3 \times
0.000057 = 0.6775</script>.</p>
<p>For the right child (gorillas) only state C has a non-zero partial likelihood,
so we should multiply the above by the probability of a different state given
the branch length 0.3 to get the partial likelihood of state A at the root,
which will be <script type="math/tex">0.6775 \times 0.0824 = 0.0558</script>.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-3.png" alt="Root state A partial likelihood" /></p>
<p>For state C at the root, the probability of child state A along the left
(HC) branch will be <script type="math/tex">0.0585 \times 0.8215 = 0.0481</script>, the probability of child
state C will be <script type="math/tex">0.8245 \times 0.00097 = 0.0008</script>, and the probabilities of
child states G or T will be <script type="math/tex">0.0585 \times 0.00097 = 0.000057</script>. So
integrating over the child states for the left branch, the probability will be
<script type="math/tex">0.0481 + 0.0008 + 2 \times 0.000057 = 0.0490</script>. Again because of the
symmetry in Jukes–Cantor, the probability along the left branch will be the
same for root states G and T.</p>
<p>However for state C at the root, the probability along the right
(gorilla) branch will be the probability of the <em>same</em> state at the end given a
branch length of 0.3, but for states G and T the probabilities will be
for a <em>different</em> state. So for state C at the root the partial likelihood
will be <script type="math/tex">0.0490 \times 0.7528 = 0.0369</script>, but for states G and T their partial
likelihoods will be <script type="math/tex">0.0490 \times 0.0824 = 0.0040</script>.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-4.png" alt="Root state partial likelihoods" /></p>
<p>Each partial likelihood for a node <script type="math/tex">n</script> is conditioned on the state <script type="math/tex">k</script> at
that node <script type="math/tex">P(D|n=k,T,h)</script>, but to calculate the likelihood at a node
<script type="math/tex">P(D|T,h)</script> we need to integrate over the probabilities <script type="math/tex">P(D,n=k|T,h)</script> for
each state at that node. Following the chain rule, we can convert the
conditional likelihoods to joint probabilities by multipling the partials by
the base (stationary) frequencies. For Jukes–Cantor the base frequencies are
all equal and hence <script type="math/tex">\frac{1}{4}</script> given there are 4 nucleotide states.</p>
<p>So we can calculate the likelihood for the entire tree by summing the root
partial likelihoods and dividing by 4. For this tree and site, the
likelihood <script type="math/tex">P(D|T,h) = \frac{0.0588 + 0.0369 + 2 \times 0.0040}{4} = 0.0252</script>.</p>Huw A. OgilvieIn this example we will calculate the likelihood where is a single site, is a rooted tree topology, and is the node heights for the tree topology. Since we are using node heights instead of branch lengths, and if we make the node heights at the tips all zeros, the tree is necessarily ultrametric. The site pattern, topology and branch lengths correspond to the following tree:Hill climbing and NNI2019-12-02T16:00:00+11:002019-12-02T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/02/hill-climbing<p>The Sankoff algorithm can efficiently calculate the parsimony score of a tree
topology. Felsenstein’s pruning algorithm can efficiently calculate the
probability of a multiple sequence alignment given a tree with branch lengths
and a substitution model. But how can the tree with the lowest parsimony
score, or highest likelihood, or highest posterior probability be identified?</p>
<p>Possibly the simplest algorithm that can do this for most kinds of inference
is hill-climbing. This algorithm basically works like this for <strong>maximum
likelihood</strong> inference:</p>
<ol>
<li>Initialize the parameters <script type="math/tex">\theta</script></li>
<li>Calculate the likelihood <script type="math/tex">L = P(D\vert\theta)</script></li>
<li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li>
<li>Calculate the likelihood <script type="math/tex">L' = P(D\vert\theta')</script></li>
<li>If <script type="math/tex">L' > L</script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">L \leftarrow L'</script></li>
<li>If stopping criteria are not met, go to 3</li>
</ol>
<p>You may notice that without <strong>stopping criteria</strong>, the algorithm is an
infinite loop. How do we know when to give up? Three obvious criteria that can
be used are:</p>
<ol>
<li>Stop after a certain number of proposals are rejected in a row (without being interrupted by any successful proposals)</li>
<li>Stop after running the algorithm for a certain length of time</li>
<li>Stop after running the algorithm for a certain number of iterations through the loop</li>
</ol>
<p>For <strong>maximum <em>a posteriori</em></strong> inference, we also need to calculate the prior
probability <script type="math/tex">P(\theta)</script>. Because the marginal likelihood <script type="math/tex">P(D)</script> does not
change, following Bayes’ rule the posterior probability <script type="math/tex">P(\theta\vert D)</script> is
proportional to <script type="math/tex">P(D\vert\theta)P(\theta)</script>, which we might call the unnormalized
posterior probability. So instead of maximizing the likelihood, we instead
maximize the product of the likelihood and prior, which we have to recalculate
for each proposal. The algorithm becomes:</p>
<ol>
<li>Initialize the parameters <script type="math/tex">\theta</script></li>
<li>Calculate the unnormalized posterior probability <script type="math/tex">P = P(D\vert\theta)P(\theta)</script></li>
<li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li>
<li>Calculate the unnormalized posterior probability <script type="math/tex">P' = P(D\vert\theta')P(\theta')</script></li>
<li>If <script type="math/tex">P' > P</script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">P \leftarrow P'</script></li>
<li>If stopping criteria are not met, go to 3</li>
</ol>
<p>For <strong>maximum parsimony</strong> inference, we simply need to calculate the parsimony
score of our parameters, so I will describe this as a function <script type="math/tex">f(D,\theta)</script>
which returns the parsimony score. The algorithm becomes:</p>
<ol>
<li>Initialize the parameters <script type="math/tex">\theta</script></li>
<li>Calculate the parsimony score <script type="math/tex">S = f(D,\theta)</script></li>
<li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li>
<li>Calculate the parsimony score <script type="math/tex">S' = f(D,\theta')</script></li>
<li>If <script type="math/tex">% <![CDATA[
S' < S %]]></script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">S \leftarrow S'</script></li>
<li>If stopping criteria are not met, go to 3</li>
</ol>
<p>Note that the inequality is reversed in step 5 for maximum parsimony. These
are all described for general cases, but for phylogenetic inference $\theta$
will correspond to a tree topology, and possibly branch lengths (for
non-ultrametric trees) or node heights (for ultrametric trees). Maximum
parsimony is unaffected by branch lengths, so $\theta$ is only the tree
topology. Proposing changes to branch lengths or node heights is relatively
simple because we can use some kind of uniform, Gaussian or other proposal
distribution. But how do we propose a small change to the tree topology?</p>
<p>A huge amount of research has gone into tree changing “operators,” but the
simplest and most straightforward is nearest-neighbor interchange, or NNI.
This works by isolating an internal branch of a tree, which for an unrooted
tree always has four connected branches. The four nodes at the end of the
connected branches may be tips or other internal nodes, because NNI can work
on trees of any size.</p>
<p>One of the nodes is fixed in place (in this example, humans), and its sister
node is exchanged with one of the two other nodes.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/unrooted-nni.png" alt="Unrooted NNI" /></p>
<p>For example the NNI move from the tree at the top to the tree in the
bottom-right exchanges mouse (M) with chimpanzee (C), causing the sister of
humans to change from chimps to mice. For four taxon trees there are only
three topologies, and they are all connected by a single NNI move. For five
taxon unrooted trees there are fifteen topologies and not all are connected:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/five-taxon-nni-space.png" alt="Five-taxon trees" /></p>
<p>In the above example, each gray line represents an NNI move between
topologies, and there is (made-up) parsimony scores above each topology.
There are two peaks in parsimony score, one for the tree (((A,E),D),(B,C))
where the parsimony score is 1434, and one for the tree (((B,E),D),(A,C))
where the parsimony score is 1435. Since the second peak has a higher
parsimony score, it is a local and not the global optimal solution.</p>
<p>This illustrates the biggest problem with hill climbing. Because we only
accept changes that improve the score, once we reach a peak where all
connected points in parameter space (unrooted topologies in this case) are
worse, then we can never climb down. Imagine we initialized our hill climbing
using the topology indicated by the black arrow. By chance we could have
followed the red path to the globally optimal solution… or the blue path to
a local optimum.</p>
<p>One straightforward way to address this weak point is to run hill climbing
<strong>multiple times</strong>. The likelihood, unnormalized posterior probability or
parsimony scores of the final accepted states for each hill climb can be
compared, and the best solution out of all runs accepted, in the hope that it
corresponds to the global optimum.</p>
<p>What about NNI for <strong>rooted trees</strong>? It works in a very similar way, but we
have to pretend that there is an “origin” tip <em>above</em> the root node, and
perform the operation on the unrooted equivalent of the rooted tree. Here
I use the example of three taxon rooted trees, and in this example I fix
the origin.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/rooted-nni.png" alt="Unrooted NNI" /></p>
<p>For three taxon rooted trees, there is one internal branch. In this example,
the “sister” to the origin for the tree on the left is humans, so the NNI
operations exchange humans with either chimps (becoming the tree on the
right), or with mice (becoming the tree on the bottom).</p>
<p>And how do we <strong>initialize</strong> hill climbing in phylogenetics? There are a
few ways.</p>
<ol>
<li>Randomly generate a tree using simulation</li>
<li>Permute the taxon labels on a predefined tree</li>
<li>Use neighbor-joining if the tree is unrooted</li>
<li>Use UPGMA if the tree is rooted</li>
</ol>
<p>The latter two methods have the advantage of starting closer to the optimal
solutions, reducing the time required for a single hill climb. However when
running hill climbing multiple times, the first two methods have the advantage
of making the different runs more independent of each other, and therefore
more likely for one to find the global optimum.</p>Huw A. OgilvieThe Sankoff algorithm can efficiently calculate the parsimony score of a tree topology. Felsenstein’s pruning algorithm can efficiently calculate the probability of a multiple sequence alignment given a tree with branch lengths and a substitution model. But how can the tree with the lowest parsimony score, or highest likelihood, or highest posterior probability be identified?Long branch attraction (in the Felsenstein zone)2019-12-01T16:00:00+11:002019-12-01T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/01/long-branch-attraction<p>Long branch attraction is the phenomenon where two branches which are in truth
not sisters are inferred to be sister branches when using maximum parsimony
inference. This occurs because, unlike likelihood, parsimony does not take
into account branch lengths when computing the parsimony score.</p>
<p>Maximum likelihood inference considers all sites when calculating the
likelihood, but only so-called “parsimony informative sites” will end up
determining the tree inferred using maximum parsimony. These are sites where
at least two tips share a state, and at least two other tips share a state
which is different from the first state.</p>
<p>Consider the case of humans, chimps, rats and mice. In truth, humans and
chimps should be sisters, as should rats and mice. The parsimony informative
sites that support the true tree topology will therefore be those where humans
and chimps share a state, and rats and mice share a state which is different
from the human/chimp state (site patterns on the left in the below figure).</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/informative-sites.png" alt="Informative site patterns" /></p>
<p>The score of those sites given the true topology (top-left in the above
figure) will be 1 for equal-cost parsimony. Given one of the two incorrect
unrooted topologies (middle-left and bottom-left), the score of those sites
will be 2, because at least two mutations along the tree are required to
explain the site pattern.</p>
<p>For the uninformative sites, e.g. if we give mice a different state from every
other species (site patterns on the right), at least two mutations will be
required for all topologies and the score will always be 2 (see trees on the
right). The contribution of these sites is therefore a constant and does not
affect the inference.</p>
<p>So if the number of parsimony informative site patterns supporting one of the
incorrect topologies is greater than the number of informative site patterns
supporting the true topology, the best scoring topology will be incorrect
and our inferred topology will be wrong.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/felsenstein-zone.png" alt="Felsenstein zone" /></p>
<p><em>Felsenstein zone trees with branch lengths in substitutions per site</em></p>
<p>How can this be possible? Consider the above-right tree. Because the internal
branch is short, and the chimp and mouse branches are also short, the
probability of mutation along those three branches is minimal. Chimps and mice
are therefore likely to share a state. But because the human and rat branches
are long, the probability of mutation is high.</p>
<p>Given a lack of mutation elsewhere, if a mutation or mutations in the human
and rat branches cause the human and rat states to differ, the site will be
uninformative. But if convergent mutations occur, the resulting site will be
parsimony informative and support the incorrect topology where humans and rats
are sister species (for example, the above site patterns).</p>
<p>These sites will contribute a score of 2 to the true topology and a score of 1
to the human-rat topology when using equal-cost parsimony, the inverse of the
contribution from parsimony informative sites that support the true
human-chimp topology. So if more of the human-rat supporting sites are in a
data set than human-chimp supporting sites, the wrong topology will be
inferred using maximum parsimony.</p>
<p>How likely is this to occur? I simulated sequence alignments for a range of
branch lengths, beginning with the above-left branch lengths, gradually
increasing the human and rat lengths (l1) while decreasing the chimp and mouse
lengths (l2), ending with the above-right branch lengths. The internal branch
length was always 0.1 substitutions per site. Jukes-Cantor was used as the
substitution model, 1 million sites were simulated per alignment. For each set
of branch lengths I counted the percentage of parsimony informative sites
supporting the correct topology and the percentage supporting the human-rat or
human-mouse topologies.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/pi-site-support.png" alt="Parsimony informative site support" /></p>
<p>You can see that when l1 is greater than somewhere between 0.75 and 0.8 or
less than somewhere between 0.3 and 0.35, the number of parsimony informative
sites supporting the human-rat topology becomes greater than the number
supporting the human-rat topology. These crossovers mark the border of the
Felsenstein zone.</p>
<p>For both Dollo and equal rates models of evolution, whether a four-taxon tree
is in the Felsenstein zone can be tested analytically rather than by
simulation. For details, see Felsenstein’s paper, “Cases in which parsimony or
compatibility methods will be positively misleading,” published in Systematic
Zoology (now known as Systematic Biology) in 1978.</p>Huw A. OgilvieLong branch attraction is the phenomenon where two branches which are in truth not sisters are inferred to be sister branches when using maximum parsimony inference. This occurs because, unlike likelihood, parsimony does not take into account branch lengths when computing the parsimony score.Likelihood of a tree2019-11-27T16:00:00+11:002019-11-27T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/27/likelihood-of-a-tree<p>The likelihood of a tree is the probability of a multiple sequence alignment
or matrix of trait states (commonly known as a character matrix) given a tree
topology, branch lengths and substitution model. An efficient dynamic
programming algorithm to compute this probability was first developed by
<a href="https://doi.org/10.1093/sysbio/22.3.240">Felsenstein in 1973</a>, and is quite similar to the algorithm used to
infer unequal-cost parsimony scores developed by <a href="https://www.jstor.org/stable/2100459">Sankoff in 1975</a>.</p>
<p>As with the Sankoff algorithm, a vector is associated with each node of the
tree. Each element of the vector stores the probability of observing the tip
states, given the tree below the associated node and the state corresponding
to the element (the first, second, third and fourth elements usually
correspond to A, C, G and T for DNA).</p>
<p>Those probabilities marginalize over all possible states at every internal
node below the root of the subtree. These are known as partial likelihoods,
and are in contrast with the vector elements of the Sankoff algorithm, which
are calculated only from the states which minimize the total cost. We might
write the partial likelihood for state <script type="math/tex">k</script> at node <script type="math/tex">n</script> as:</p>
<script type="math/tex; mode=display">P_{n,k} = P(D_i|k, T, l, M)</script>
<p>where <script type="math/tex">D_i</script> is the tip states at position <script type="math/tex">i</script> of the multiple sequence
alignment or character matrix, <script type="math/tex">T</script> is the topology of the subtree under the
node, <script type="math/tex">l</script> is the branch lengths of the subtree, and <script type="math/tex">M</script> is the
substitution model. I will go over the five key differences between the two
algorithms.</p>
<p><strong>One.</strong> For the Sankoff algorithm the elements in the vectors at the tips are
initialized to either zero for the observed states or infinity otherwise,
because the only the observed state can be the state at the tips. However
because partial likelihoods are probabilities not costs, for likelihood they
are initialized to 1 for 100% probability (or 0 if working in log space) for
the observed states, and 0 for 0% probability (or negative infinity if
working in log space).</p>
<p><strong>Two.</strong> Because Felsenstein’s likelihood depends on branch lengths and not
just topology, the transition probabilities must be recomputed for each
branch. For the Jukes-Cantor model just two probabilties are needed because
it assumes equal base frequencies and transition rates. The first is the
probability of state <script type="math/tex">k</script> at the parent node and state <script type="math/tex">k'</script> at the child
node being the same <strong>conditioned on</strong> the <script type="math/tex">k</script>:</p>
<script type="math/tex; mode=display">P(k' = k|k) = P_{xx} = \frac{1}{4}(1 + 3 e^{-\frac{4}{3}\mu t})</script>
<p>where $\mu t$ is the product of the substitution rate and length of the branch
in time, which is the length of the branch in substitutions per site. And the
second is the probability of the state at the child node being different,
again conditioned on the state at the parent node:</p>
<script type="math/tex; mode=display">P(k' \ne k|k) = P_{xy} = \frac{1}{4}(1 - e^{-\frac{4}{3}\mu t})</script>
<p><strong>Three.</strong> Because the partial likelihoods marginalize over the internal node
states, for each child branch the probabilities for all child node states must
be summed over rather than finding the minimum cost. Using Jukes-Cantor, when
calculating the partial likelihood for state <script type="math/tex">k</script> at node <script type="math/tex">n</script>, for the one
case where the state <script type="math/tex">k'</script> at the child node <script type="math/tex">c</script> equals <script type="math/tex">k</script>, the
probability is <script type="math/tex">P_{xx} P_{c,k'}</script>. For the three cases where it does not,
the probabilities are <script type="math/tex">P_{xy}P_{c,k'}</script>. By summing all four probabilities,
we marginalize over the possible states at that child node.</p>
<p><strong>Four.</strong> Cost accumulates, but the joint probability of independent
variables multiplies. So for parsimony the cost of the left and right subtrees
under a node (stored in the vectors associated with the left and right
children) and the cost of the mutations along the left and right child
branches (if any) are all added together. But for likelihood the left and
right marginal probabilities are multiplied. Why are left and right marginal
probabilities independent? Because sequences evolve independently along left
and right subtrees, conditioned on the state at the root.</p>
<p>This also applies when calculating the cost or likelihood of a sequence
alignment or character matrix. For maximum parsimony the cost accumulates for
each additional site, so the parsimony score of an alignment is the sum of
minimum costs for each site. But for maximum likelihood the likelihood of each
site is a probability and we treat each site as evolving independently, so the
likelihood for the alignment is the product of site likelihoods.</p>
<p><strong>Five.</strong> For maximum parsimony, the smallest element of the root node vector
gives the parsimony score of the tree. But for Felsenstein’s likelihood, we want to
marginalize over root states, i.e. we want <script type="math/tex">P(D_i|T,l,M)</script> which does not
depend on state <script type="math/tex">k</script> at the root. Given the RNA alphabet
<script type="math/tex">\{A,C,G,U\}</script>, we can perform this marginalization by summing over the joint
probabilities:</p>
<script type="math/tex; mode=display">P(D_i|T,l,M) = P(D_i,k=A|T,l,M) + P(D_i,k=C|T,l,M) + P(D_i,k=G|T,l,M) + P(D_i,k=U|T,l,M)</script>
<p>But the partial likelihoods at the root give us <script type="math/tex">P(D_i|k, T, l, M)</script>, where
state <script type="math/tex">k</script> is on the right side of the conditional. We can use the chain
rule to convert them to joint probabilities:</p>
<script type="math/tex; mode=display">P(D_i,k|T,l,M) = P(D_i|k,T,l,M) \cdot P(k)</script>
<p>but what is <script type="math/tex">P(k)</script>? It is the stationary frequency of the state, which for
Jukes-Cantor is always <script type="math/tex">\frac{1}{4}</script>, so for that substitution model we just
have to sum the partial likelihoods at the root and divide by four to get the
likelihood of the tree.</p>
<p>The following code will calculate the likelihood of a tree (in Newick format)
for a multiple sequence alignment (MSA in FASTA format), with the paths to the
tree and MSA files given as the first and second arguments to the program.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import ete3
import numpy
import os.path
import sys
neginf = float("-inf")
# used by read_fasta to turn a sequence string into a vector of integers based
# on the supplied alphabet
def vectorize_sequence(sequence, alphabet):
sequence_length = len(sequence)
sequence_vector = numpy.zeros(sequence_length, dtype = numpy.uint8)
for i, char in enumerate(sequence):
sequence_vector[i] = alphabet.index(char)
return sequence_vector
# this is a function that reads in a multiple sequence alignment stored in
# FASTA format, and turns it into a matrix
def read_fasta(fasta_path, alphabet):
label_order = []
sequence_matrix = numpy.zeros(0, dtype = numpy.uint8)
fasta_file = open(fasta_path)
l = fasta_file.readline()
while l != "":
l_strip = l.rstrip() # strip out newline characters
if l[0] == ">":
label = l_strip[1:]
label_order.append(label)
else:
sequence_vector = vectorize_sequence(l_strip, alphabet)
sequence_matrix = numpy.concatenate((sequence_matrix, sequence_vector))
l = fasta_file.readline()
fasta_file.close()
n_sequences = len(label_order)
sequence_length = len(sequence_matrix) // n_sequences
sequence_matrix = sequence_matrix.reshape(n_sequences, sequence_length)
return label_order, sequence_matrix
# this is a function that reads in a phylogenetic tree stored in newick
# format, and turns it into an ete3 tree object
def read_newick(newick_path):
newick_file = open(newick_path)
newick = newick_file.read().strip()
newick_file.close()
tree = ete3.Tree(newick)
return tree
def recurse_likelihood(node, site_i, n_states):
if node.is_leaf():
node.partial_likelihoods.fill(0) # reset the leaf likelihoods
leaf_state = node.sequence[site_i]
node.partial_likelihoods[leaf_state] = 1
else:
left_child, right_child = node.get_children()
recurse_likelihood(left_child, site_i, n_states)
recurse_likelihood(right_child, site_i, n_states)
for node_state in range(n_states):
left_partial_likelihood = 0.0
right_partial_likelihood = 0.0
for child_state in range(n_states):
if node_state == child_state:
left_partial_likelihood += left_child.pxx * left_child.partial_likelihoods[child_state]
right_partial_likelihood += right_child.pxx * right_child.partial_likelihoods[child_state]
else:
left_partial_likelihood += left_child.pxy * left_child.partial_likelihoods[child_state]
right_partial_likelihood += right_child.pxy * right_child.partial_likelihoods[child_state]
node.partial_likelihoods[node_state] = left_partial_likelihood * right_partial_likelihood
# nucleotides, obviously
alphabet = "ACGT" # A = 0, C = 1, G = 2, T = 3
n_states = len(alphabet)
# this script requires a newick tree file and fasta sequence file, and
# the paths to those two files are given as arguments to this script
tree_path = sys.argv[1]
root_node = read_newick(tree_path)
msa_path = sys.argv[2]
taxa, alignment = read_fasta(msa_path, alphabet)
site_count = len(alignment[0])
# the number of taxa, and the number of nodes in a rooted phylogeny with that
# number of taxa
n_taxa = len(taxa)
n_nodes = n_taxa + n_taxa - 1
for node in root_node.traverse():
# initialize a vector of partial likelihoods that we can reuse for each site
node.partial_likelihoods = numpy.zeros(n_states)
# we can precalculate the pxx and pxy values for the branch associated with
# this node
node.pxx = (1 / 4) * (1 + 3 * numpy.exp(-(4 / 3) * node.dist))
node.pxy = (1 / 4) * (1 - numpy.exp(-(4 / 3) * node.dist))
# add sequences to leaves
if node.is_leaf():
taxon = node.name
taxon_i = taxa.index(taxon)
node.sequence = alignment[taxon_i]
# this will be the total likelihood of all sites
log_likelihood = 0.0
for site_i in range(site_count):
recurse_likelihood(root_node, site_i, n_states)
# need to multiply the partial likelihoods by the stationary frequencies
# which for Jukes-Cantor is 1/4 for all states
log_likelihood += numpy.log(numpy.sum(root_node.partial_likelihoods * (1 / 4)))
tree_filename = os.path.split(tree_path)[1]
msa_filename = os.path.split(msa_path)[1]
tree_name = os.path.splitext(tree_filename)[0]
msa_name = os.path.splitext(msa_filename)[0]
print("The log likelihood P(%s|%s) = %f" % (msa_name, tree_name, log_likelihood))
</code></pre></div></div>Huw A. OgilvieThe likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree topology, branch lengths and substitution model. An efficient dynamic programming algorithm to compute this probability was first developed by Felsenstein in 1973, and is quite similar to the algorithm used to infer unequal-cost parsimony scores developed by Sankoff in 1975.Equal-cost parsimony2019-11-26T16:00:00+11:002019-11-26T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/26/equal-cost-parsimony<p>The principle behind maximum parsimony based inference is to explain the data
using the smallest cost. In its most basic form, all events are given equal
cost, so a nucleotide changing from A to C (a transversion) is given the same
cost as a change from C to T (a transition). Likewise the gain of a trait,
e.g. flight, is given the same cost as the loss of that trait. In this case
finding the explanation with the smallest cost is the same as finding the
explanation with the smallest number of events. In a phylogenetic context, the
explanation is the tree topology, and the events are mutations of molecular
sequences or organismal traits.</p>
<p>Equal cost parsimony can be solved using a simple procedure called the Fitch
algorithm (<a href="https://doi.org/10.1093/sysbio/20.4.406">Fitch, 1971</a>). The output of this algorithm is the smallest
number of events required to explain the pattern of one site or trait for a
given tree topology.</p>
<p>As an example, let’s consider a genomic position homologous between apes and
rodents. At this position the nucleotide observed for humans and chimps is
adenine (A), for gorillas and mice it is cytosine (C), and for rats it is
guanine (G). We will compute the parsimony score for a given tree topology, in
this case one what treats humans and chimps and sisters, and also mice and rats
as sisters.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-0.png" alt="Topology and site pattern" /></p>
<p>Like other dynamic programming algorithms for phylogenetic inference, we need
initialize the values at each tip. For the Fitch algorithm, there are two
different kinds of values at each node;</p>
<ol>
<li>a set of most parsimonious states given the site pattern and topology <strong>under that node</strong></li>
<li>the minimum number of changes required to explain the site pattern under given the topology <strong>under that node</strong></li>
</ol>
<p>For the tip nodes, each set has a single element corresponding to the observed state,
and the minimum number of changes is always zero.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-1.png" alt="Initial states" /></p>
<p>Then we need to recurse through the internal nodes of the tree, always
visiting children before parents. The most straightforward way to accomplish
this is <a href="https://opendsa-server.cs.vt.edu/ODSA/Books/Everything/html/BinaryTreeTraversal.html">postorder traversal</a>. However, for this example we will use
levelorder traversal, visiting the lowest level of nodes first, then the next
lowest, until we get to the root.</p>
<p>For each node we first calculate the intersection of the sets of most
parsimonious states from the node’s children. For humans and chimps the
intersection contains a single state “A”, but for rodents the intersection is
empty.</p>
<p>When the intersection is non-empty, we add all elements of the intersection to
the set of most parsimonious states for a given node. A non-empty intersection
also means that no changes are required along either branch leading to the
children, as at least one most parsimonious state is present in all three sets
(parent and two children).</p>
<p>Since no changes are required, we calculate the parsimony score for that node
(the minimum number of required changes) by simply adding the parsimony score
for the two children. In the case of humans and chimps, the intersection
is {“A”} and the sum of parsimony scores is 0.</p>
<p>When the intersection is empty, we add all elements of the <em>union</em> to the set
of most parsimonious states. For each state in the union, it will either be
present in the parent and left child sets, or the parent and right child sets.
In both cases we need at least one mutation to explain the pattern, but the
mutation will be on the left or right branch respectively. So the parsimony
score will be the sum of scores of the children, <em>plus one</em>. In the case of
rodents, the union is {C, G} and the parsimony score will be 0 + 0 + 1 = 1.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-2.png" alt="Level 1" /></p>
<p>For the ancestor of humans, chimps and gorillas (Homininae), the intersection
of the human and chimp set on the left {A} and the gorilla set {C} is empty,
so we use the union {A, C}. Since the intersection was empty, the parsimony
score will be the sum of child scores plus one.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-3.png" alt="Level 2" /></p>
<p>Finally at the root, the intersection of the ape set {A, C} and the rodent set
{C, G} is nonempty, as C is present in both. So the most parsimonious state at
the root will be C, and since this state is present in all three sets, we do
not need to invoke changes and only need to sum the child scores. For this
example this sum is 1 + 1 = 2.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-4.png" alt="Root" /></p>
<p>Equal cost parsimony will derive the same score for any rooted tree with the
same unrooted topology. In other words, neither the rooting nor the branch
lengths affect the score in any way (at least in terms of inference). Given
five taxa as in the above example, there are fifteen possible unrooted topologies:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/topologies.png" alt="Root" /></p>
<p>I have given the parsimony score for each topology given the site pattern. In
this case there are five maximum parsimony solutions, and we cannot
distinguish between them. Luckily one of these is the “true in real life” tree
topology for these organisms (left middle).</p>
<p>The parsimony score of a multiple sequence alignment, or the character matrix
of a set of traits, is the sum of parsimony scores for all sites in the
alignment or all traits. By sampling enough sites and/or traits we should be
able to identify a single optimal tree from its parsimony score.</p>Huw A. OgilvieThe principle behind maximum parsimony based inference is to explain the data using the smallest cost. In its most basic form, all events are given equal cost, so a nucleotide changing from A to C (a transversion) is given the same cost as a change from C to T (a transition). Likewise the gain of a trait, e.g. flight, is given the same cost as the loss of that trait. In this case finding the explanation with the smallest cost is the same as finding the explanation with the smallest number of events. In a phylogenetic context, the explanation is the tree topology, and the events are mutations of molecular sequences or organismal traits.Dollo’s law and unequal-cost parsimony2019-11-26T16:00:00+11:002019-11-26T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/26/unequal-cost-parsimony<p>Certain mutations are more surprising than others. DNA is composed of a string
of nucleotides, which are either pyrimadines (cytosine or thymine) or purines
(adenine or guanine). A single point mutation to DNA is either a <em>transition</em>
from one pyrimadine to another or one purine to another, or a <em>transversion</em>
from a purine to a pyrimadine or <em>vice versa</em>. Transitions are biochemically
easier than transversions, and hence much more commonly occuring in the
evolution of genomes.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/purines-pyrimadines.png" alt="Purines and pyrimadines" /></p>
<p>Image from Wikipedia user Zephyris</p>
<p>This principle also applies to traits. Dollo’s law states that complex
characters, once lost from a lineage, are unlikely to be regained
(<a href="https://doi.org/10.1002/jez.b.22642">Wright <em>et al</em>. 2015</a>, <a href="https://paleoglot.org/files/Dollo_93.pdf">Dollo 1893</a>). For example, the evolution of
flight in bats required the evolution of multiple components like wing
membranes, a novel complex of muscles and low-mass bones
(<a href="https://doi.org/10.1002/wdev.50">Cooper <em>et al</em>. 2010</a>). Once any one of those components are lost the
others are likely to be lost too. Because regaining the trait will require so
many components to be regained, it is unlikely. Therefore we should be more
surprised by a transition from flightlessness to flightedness than the
reverse.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/bat-wing.jpg" alt="Bat wing skeleton" /></p>
<p>Figure 1 from <a href="https://doi.org/10.1002/wdev.50">Cooper et al. (2010)</a> showing the thin elongated metacarpals
and phalanges of Seba’s short‐tailed bat.</p>
<p>Equal-cost parsimony, for example when using the Fitch algorithm, does not
account for this kind of difference in expectations. However unequal-cost
parsimony uses a cost matrix to assign different costs to different
transitions. For the DNA evolution example, it might look something like
this:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">A</th>
<th style="text-align: right">C</th>
<th style="text-align: right">G</th>
<th style="text-align: right">T</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">A</td>
<td style="text-align: right">0</td>
<td style="text-align: right">5</td>
<td style="text-align: right">1</td>
<td style="text-align: right">5</td>
</tr>
<tr>
<td style="text-align: right">C</td>
<td style="text-align: right">5</td>
<td style="text-align: right">0</td>
<td style="text-align: right">5</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td style="text-align: right">G</td>
<td style="text-align: right">1</td>
<td style="text-align: right">5</td>
<td style="text-align: right">0</td>
<td style="text-align: right">5</td>
</tr>
<tr>
<td style="text-align: right">T</td>
<td style="text-align: right">5</td>
<td style="text-align: right">1</td>
<td style="text-align: right">5</td>
<td style="text-align: right">0</td>
</tr>
</tbody>
</table>
<p>This cost matrix penalizes a transversion five times more than it penalizes a
transition. For the trait evolution example, it might look something like this:</p>
<table>
<thead>
<tr>
<th style="text-align: right">.</th>
<th style="text-align: right">+</th>
<th style="text-align: right">-</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">+</td>
<td style="text-align: right">0</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td style="text-align: right">-</td>
<td style="text-align: right">Infinity</td>
<td style="text-align: right">0</td>
</tr>
</tbody>
</table>
<p>In the above matrix, plus is used to indicate the presence of a trait (e.g.
flight) and a minus indicates the absence. This kind of matrix is known as a
Dollo model, where only forward transitions (from + to -, i.e. losing the
trait) are allowed, and reverse transitions are prohibited. Using this model
implies that the trait <em>must</em> have been present in the most recent common
ancestor (MRCA) of all species in the tree, so it will be inappropriate to use
when the trait was absent from the MRCA.</p>
<p>The <a href="https://www.jstor.org/stable/2100459">Sankoff algorithm</a> uses dynamic programming to efficiently calculate
the parsimony score for a given tree topology and cost matrix. Let’s use
the DNA cost matrix above to demonstrate it.</p>
<p>A vector is associated with every node of the tree. The size of the vector is
the size of the alphabet for a character, so 2 for a binary trait like flight,
4 for DNA or 20 for proteins. Each element of the vector corresponds to one of
the possible states for that character. Each element of the vector stores the
parsimony score for the tree topology under a node, given the state at that
node corresponding to the element, and the known tip states.</p>
<p>To initialize the tip node vectors, set the cost for the elements
corresponding to known tip states to zero. The other states are known to be
not true, so they should never be considered. This could be achieved by
setting their cost to infinity, represented here by dots.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-1.png" alt="Sankoff" /></p>
<p>For each element of each internal node, we have to consider the cost of each
possible transition for each child branch. The parsimony score for the element
is the minimum possible cost for the left branch, plus the minimum possible
cost for the right branch. The cost for each possible transition is the
corresponding value from the cost matrix, plus the score in the corresponding
child element.</p>
<p>Consider the MRCA of humans and chimps. For state A, the cost of transitioning
to A in humans will be 0 + 0 = 0, to C will be 5 + ∞ = ∞, to G will be 1 + ∞ =
∞, and to T will be 5 + ∞ = ∞. The minimum for the left branch for the left
branch is therefore 0. Since chimps have the same state as humans in this
example, the cost will be the same, and the sum of minimum costs will be 0.</p>
<p>Repeat for C, G and T.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-2.png" alt="Sankoff" /></p>
<p>Now consider the MRCA of humans, chimps and gorillas. For state A, the cost of
transitioning to A in the human/chimp MRCA will be 0 + 0 = 0, to C will be 5 +
10 = 15, to G will be 1 + 2 = 3, and to T will be 5 + 10 = 15. So the minimum
along the left branch is 0. The cost of transitioning from A to C in gorillas
will be 5 + 0 = 5, and from A to other gorilla states will be ∞. Therefore the
minimum cost along the right branch is 5, and the parsimony score for state A
is 0 + 5 = 5.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-3.png" alt="Sankoff" /></p>
<p>Repeat the above for the remaining nodes. Here we are walking the tree
postorder, but like for the Fitch algorithm levelorder would work too.</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-4.png" alt="Sankoff" /></p>
<p>Finally, the parsimony score for the entire tree is the minimum score out of
the root states - for this tree and site pattern, 10. As with equal-cost
parsimony, the score for an entire multiple sequence alignment or character
matrix is the sum of parsimony scores for each position or for each character
respectively.</p>Huw A. OgilvieCertain mutations are more surprising than others. DNA is composed of a string of nucleotides, which are either pyrimadines (cytosine or thymine) or purines (adenine or guanine). A single point mutation to DNA is either a transition from one pyrimadine to another or one purine to another, or a transversion from a purine to a pyrimadine or vice versa. Transitions are biochemically easier than transversions, and hence much more commonly occuring in the evolution of genomes.Backward algorithm2019-10-13T16:00:00+11:002019-10-13T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/10/13/backward-algorithm<p>Like the forward algorithm, we can use the backward algorithm to calculate
the marginal likelihood of a hidden Markov model (HMM). Also like the forward
algorithm, the backward algorithm is an instance of dynamic programming where
the intermediate values are probabilities.</p>
<p>Recall the forward matrix values can be specified as:</p>
<p>f<sub><em>i</em>,<em>k</em></sub> = P(x<sub>1..<em>i</em></sub>,π<sub><em>i</em></sub>=k|M)</p>
<p>That is, the forward matrix contains joint probabilities for the sequence
up to the <em>i</em><sup>th</sup> position, and the state at that position being <em>k</em>.
These joint probabilities are not conditional on the previous states, instead
they are marginalizing over the hidden state path leading up to <em>i</em>,<em>k</em>.</p>
<p>In contrast, the backward matrix contains probabilities for the sequence
<em>after</em> the <em>i</em><sup>th</sup> position, and these probabilities are conditional
on the state being <em>k</em> at <em>i</em>:</p>
<p>b<sub><em>i</em>,<em>k</em></sub> = P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=k,M)</p>
<p>To demonstrate the backward algorithm, we will use the same example sequence
and HMM as for the Viterbi and forward algorithm demonstrations:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/cpg-island-hmm-log.png" alt="CG rich island HMM" /></p>
<p>The backward matrix probabilities are marginalized over the hidden state path
after <em>i</em>. To calculate them, initialize a backward matrix <em>b</em> of the same
dimensions as the corresponding forward matrix. We will work in log space, so
use negative infinity for the start state other than at the start position <em>i</em>
= 0, and for the non-start states at the start position. The probability of an
empty sequence after the last position is 100% regardless of the state at the
last position, so fill in zeros for the non-start states at the last column
<em>i</em> = <em>n</em>:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">{}</th>
<th style="text-align: right">G</th>
<th style="text-align: right">G</th>
<th style="text-align: right">C</th>
<th style="text-align: right">A</th>
<th style="text-align: right">C</th>
<th style="text-align: right">T</th>
<th style="text-align: right">G</th>
<th style="text-align: right">A</th>
<th style="text-align: right">A</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">start</td>
<td style="text-align: right"> </td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
</tr>
<tr>
<td style="text-align: right">CG rich</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right">0.0</td>
</tr>
<tr>
<td style="text-align: right">CG poor</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right"> </td>
<td style="text-align: right">0.0</td>
</tr>
</tbody>
</table>
<p>To calculate the backward probabilities for a given non-start hidden state <em>k</em> at the
second-to-last position <em>i</em> = <em>n</em> - 1 through to the position of the first character <em>i</em> = 1,
gather the following log probabilities for each non-start hidden state <em>k’</em> at position <em>i</em> + 1:</p>
<ol>
<li>the emission probability e<sub><em>i</em>+1,<em>k’</em></sub> of the observed state (character) at <em>i</em> + 1 given <em>k’</em></li>
<li>the hidden state transition probability t<sub><em>k</em>,<em>k’</em></sub> from state <em>k</em> at <em>i</em> to state <em>k’</em> at <em>i</em> + 1</li>
<li>the probability <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> of the sequence after <em>i</em> + 1 given state <em>k’</em> at <em>i</em> + 1</li>
</ol>
<p>The sum of the above log probabilities gives us the probability for the
character and hidden state at <em>i</em> + 1, given a particular state at <em>i</em>.
Beginning at the second last position <em>i</em> = <em>n</em> - 1, moving from the CG rich state to the CG rich
state, the log probabilities are e<sub><em>i</em>+1,<em>k’</em></sub> = -2, t<sub><em>k</em>,<em>k’</em></sub> = -0.5 and <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> = 0 respectively, and their sum is
-2.5. The first two are from the HMM, and the last is from the
backward matrix. When moving from the CG rich state to the CG poor state, they
are -1, -1 and 0 respectively and the sum is -2.</p>
<p>Finally, we can calculate <em>b</em><sub><em>i</em>,<em>k</em></sub> by marginalizing over both possible
transitions. For the CG rich hidden state at the second-to-last position:</p>
<p><em>b</em><sub><em>n</em>-1,<em>CG rich</em></sub> = log(P(x<sub><em>n</em></sub>|π<sub><em>n</em>-1</sub>=CG rich,M)) = log(e<sup>-2.5</sup> + e<sup>-2</sup>) = -1.5</p>
<p>Likewise for the CG poor state at position <em>n</em> - 1, the marginal log probability is:</p>
<p><em>b</em><sub><em>n</em>-1,<em>CG poor</em></sub> = log(P(x<sub><em>n</em></sub>|π<sub><em>n</em>-1</sub>=CG rich,M)) = log(e<sup>-3</sup> + e<sup>-1.5</sup>) = -1.3</p>
<p>We can now update the matrix:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">{}</th>
<th style="text-align: right">G</th>
<th style="text-align: right">G</th>
<th style="text-align: right">C</th>
<th style="text-align: right">A</th>
<th style="text-align: right">C</th>
<th style="text-align: right">T</th>
<th style="text-align: right">G</th>
<th style="text-align: right">A</th>
<th style="text-align: right">A</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">start</td>
<td style="text-align: right"> </td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
</tr>
<tr>
<td style="text-align: right">CG rich</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-11.2</td>
<td style="text-align: right">-9.9</td>
<td style="text-align: right">-8.6</td>
<td style="text-align: right">-7.0</td>
<td style="text-align: right">-5.7</td>
<td style="text-align: right">-4.1</td>
<td style="text-align: right">-2.9</td>
<td style="text-align: right">-1.5</td>
<td style="text-align: right">0.0</td>
</tr>
<tr>
<td style="text-align: right">CG poor</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-11.5</td>
<td style="text-align: right">-10.1</td>
<td style="text-align: right">-8.5</td>
<td style="text-align: right">-7.2</td>
<td style="text-align: right">-5.6</td>
<td style="text-align: right">-4.3</td>
<td style="text-align: right">-2.6</td>
<td style="text-align: right">-1.3</td>
<td style="text-align: right">0.0</td>
</tr>
</tbody>
</table>
<p>At the start position <em>i</em> = 0, the only valid hidden state is the start state.
Therefore at that position we only need to calculate the probability of going
from the start state to the CG rich or CG poor states. For moving to the CG rich state,
the log probabilities are e<sub><em>i</em>+1,<em>k’</em></sub> = -1.0, t<sub><em>k</em>,<em>k’</em></sub> = -0.7 and <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> = -11.2.
For moving to the CG poor state, they are -2.0, -0.7 and -11.5 respectively.
The sums are -12.9 and -14.2 respectively, and the log sum of exponentials
is -12.7. We will use this to complete the backward matrix:</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">{}</th>
<th style="text-align: right">G</th>
<th style="text-align: right">G</th>
<th style="text-align: right">C</th>
<th style="text-align: right">A</th>
<th style="text-align: right">C</th>
<th style="text-align: right">T</th>
<th style="text-align: right">G</th>
<th style="text-align: right">A</th>
<th style="text-align: right">A</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">start</td>
<td style="text-align: right">-12.7</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
</tr>
<tr>
<td style="text-align: right">CG rich</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-11.2</td>
<td style="text-align: right">-9.9</td>
<td style="text-align: right">-8.6</td>
<td style="text-align: right">-7.0</td>
<td style="text-align: right">-5.7</td>
<td style="text-align: right">-4.1</td>
<td style="text-align: right">-2.9</td>
<td style="text-align: right">-1.5</td>
<td style="text-align: right">0.0</td>
</tr>
<tr>
<td style="text-align: right">CG poor</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-11.5</td>
<td style="text-align: right">-10.1</td>
<td style="text-align: right">-8.5</td>
<td style="text-align: right">-7.2</td>
<td style="text-align: right">-5.6</td>
<td style="text-align: right">-4.3</td>
<td style="text-align: right">-2.6</td>
<td style="text-align: right">-1.3</td>
<td style="text-align: right">0.0</td>
</tr>
</tbody>
</table>
<p>Because the only valid hidden state for the start position is the start state,
the probability P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=k,M) can be simplified to
P(x<sub><em>i</em>+1..<em>n</em></sub>|M). Because the sequence after the start position is the
entire sequence, it can be further simplified to P(x|M). In other words, this
probability is our marginal likelihood! While this is slightly different from
the marginal likelihood of -12.6 derived using the forward algorithm, that is
a rounding error caused by our limited precision of one decimal place.</p>
<p>Why do we need two dynamic programming algorithms to compute the marginal
likelihood? We don’t! But by combining probabilities from the two matrices, we
can derive the posterior probability of each hidden state <em>k</em> at each position
<em>i</em>, marginalized over all paths through <em>k</em> at <em>i</em>. How this this work? Let’s
use Bayes’ rule to demonstrate:</p>
<p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x|π<sub><em>i</em></sub>=<em>k</em>,M) × P(π<sub><em>i</em></sub>=<em>k</em>|M) / P(x|M)</p>
<p>If two variables <em>a</em> and <em>b</em> are independent, their joint probability
P(<em>a</em>,<em>b</em>) is simply the product of their probabilities P(<em>a</em>) × P(<em>b</em>).
Normally the two segments of the sequence x<sub>1..<em>i</em></sub> and
x<sub><em>i</em>+1..<em>n</em></sub> are not independent because we are using a hidden Markov
model. Under our model, the distribution of characters at a given site is
dependent on the hidden state at that site, which in turn is dependent on the
hidden state at the previous site.</p>
<p>But by conditioning on the hidden state at a given site <em>i</em>, the sequence
after that site x<sub><em>i</em>+1..<em>n</em></sub> is independent of the sequence up to and
including <em>i</em>. This is because the hidden state at <em>i</em> is fixed rather than depending
on the previous hidden state, or the observed character at <em>i</em>. In other
words, while P(x<sub>1..<em>i</em></sub>|M) and P(x<sub><em>i</em>+1..<em>n</em></sub>|M) are not
independent, P(x<sub>1..<em>i</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) and
P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) are! Therefore:</p>
<p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x<sub>1..<em>i</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) × P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) × P(π<sub><em>i</em></sub>=<em>k</em>|M) / P(x|M)</p>
<p>By applying the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">chain rule</a>, we can take the third term of the expression
on the right side of our equation, and fold it into the first term of that expression.
This changes the conditional probability to a joint probability:</p>
<p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x<sub>1..<em>i</em></sub>,π<sub><em>i</em></sub>=<em>k</em>|M) × P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) / P(x|M)</p>
<p>On the right side of the equation, the first term now corresponds to <em>f</em><sub><em>i</em>,<em>k</em></sub>,
the second term to <em>b</em><sub><em>i</em>,<em>k</em></sub>, and the third to <em>b</em><sub><em>0</em>,<em>start</em></sub>. This
makes it possible to replace every term on the right side expression with matrix coordinates:</p>
<p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = <em>f</em><sub><em>i</em>,<em>k</em></sub> × <em>b</em><sub><em>i</em>,<em>k</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub></p>
<p>And now we can now “decode” our posterior distribution of hidden states. We need to refer
back to the previously calculated forward matrix, shown below.</p>
<table>
<thead>
<tr>
<th style="text-align: right"> </th>
<th style="text-align: right">{}</th>
<th style="text-align: right">G</th>
<th style="text-align: right">G</th>
<th style="text-align: right">C</th>
<th style="text-align: right">A</th>
<th style="text-align: right">C</th>
<th style="text-align: right">T</th>
<th style="text-align: right">G</th>
<th style="text-align: right">A</th>
<th style="text-align: right">A</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">start</td>
<td style="text-align: right">0.0</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-∞</td>
</tr>
<tr>
<td style="text-align: right">CG rich</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-1.7</td>
<td style="text-align: right">-3.0</td>
<td style="text-align: right">-4.3</td>
<td style="text-align: right">-6.6</td>
<td style="text-align: right">-7.3</td>
<td style="text-align: right">-9.6</td>
<td style="text-align: right">-10.2</td>
<td style="text-align: right">-12.5</td>
<td style="text-align: right">-14.1</td>
</tr>
<tr>
<td style="text-align: right">CG poor</td>
<td style="text-align: right">-∞</td>
<td style="text-align: right">-2.7</td>
<td style="text-align: right">-4.2</td>
<td style="text-align: right">-5.6</td>
<td style="text-align: right">-5.9</td>
<td style="text-align: right">-8.1</td>
<td style="text-align: right">-8.7</td>
<td style="text-align: right">-11.0</td>
<td style="text-align: right">-11.6</td>
<td style="text-align: right">-12.9</td>
</tr>
</tbody>
</table>
<p>As an example, let’s solve the posterior probability that the hidden state of the fourth character
is CG rich:</p>
<p>P(π<sub><em>4</em></sub>=<em>CG rich</em>|x,M) = <em>f</em><sub><em>4</em>,<em>CG rich</em></sub> × <em>b</em><sub><em>4</em>,<em>CG rich</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub> = e<sup>-7.0</sup> × e<sup>-6.6</sup> / e<sup>-12.7</sup> = 41%</p>
<p>Since we only have two states, given a 41% posterior probability of the CG rich
state, the probability of the CG poor state should be 59%, but the rounding
errors caused by our lack of precision are causing serious problems:</p>
<p>P(π<sub><em>4</em></sub>=<em>CG poor</em>|x,M) = <em>f</em><sub><em>4</em>,<em>CG poor</em></sub> × <em>b</em><sub><em>4</em>,<em>CG poor</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub> = e<sup>-7.2</sup> × e<sup>-5.9</sup> / e<sup>-12.7</sup> = 67%</p>
<p>Using the above method with a high degree of precision, the posterior
probabilities are more precisely calculated as 37% and 63% respectively.
The posterior probabilities can be shown as a graph in order to clearly
communicate your results:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/cpg-posterior.png" alt="CG rich island HMM" /></p>
<p>This gives us a result that reflects the uncertainty of our inference given
the limited data at hand. In my opinion, this presentation is more honest than
the black-and-white maximum <em>a posteriori</em> result derived using Viterbi’s
algorithm.</p>
<p>For another perspective on HMMs, including the Viterbi and Forward-Backward
algorithms, consult Chapter 10 of <a href="http://bioinformaticsalgorithms.com/">Bioinformatics Algorithms</a> (2nd or 3rd Edition)
by Compeau and Pevzner.</p>Huw A. OgilvieLike the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algorithm, the backward algorithm is an instance of dynamic programming where the intermediate values are probabilities.COMP571/BIOC571 (Fall 2019)2019-08-29T15:00:00+10:002019-08-29T15:00:00+10:00http://www.cs.rice.edu/~ogilvie/comp571/2019/08/29/comp571-bioc571<p><strong>Important note:</strong> The information contained in the course syllabus, other than
the absence policies, may be subject to change with reasonable advance notice,
as deemed appropriate by the instructor.</p>
<h1 id="who">Who</h1>
<p>Instructor:</p>
<ul>
<li>Huw A. Ogilvie</li>
<li>Duncan Hall 3098</li>
<li><a href="mailto:hao3@rice.edu">hao3@rice.edu</a></li>
</ul>
<p>TA:</p>
<ul>
<li>Zhen Cao</li>
<li>Duncan Hall 3061</li>
<li><a href="mailto:zc36@rice.edu">zc36@rice.edu</a></li>
</ul>
<h1 id="where-and-when">Where and when</h1>
<p>Distribution of class materials and assignment submission will be conducted
via <a href="https://canvas.rice.edu/">Canvas</a>.</p>
<p>Seminars will be held in Duncan Hall <strong>1046</strong>, on Tuesdays and Thursdays, between
2:30–3:45 PM.</p>
<p>One scheduled office hour will be held each week, at 1:00 PM on Thursdays in Duncan Hall 3061. Individual appointments outside this time are welcome.</p>
<h1 id="intended-audience">Intended audience</h1>
<p>The students who should take COMP571/BIOC571 are generally studying computer
science, biology or genomics, and wish to learn how to apply algorithms and
statistical models to important problems in biology and genomics.</p>
<h1 id="course-objectives-and-learning-outcomes">Course objectives and learning outcomes</h1>
<p>The primary objective of the course is to teach the theory behind methods in
biological sequence analysis, including sequence alignment, sequence motifs,
and phylogenetic tree reconstruction. By the end of the course, students are
expected to understand and be able to write basic implementations of the
algorithms which power those methods.</p>
<h1 id="course-materials">Course materials</h1>
<p>The main material for this course will be the course blog. But, if you wish to
purchase a textbook, I highly recommend <strong>Bioinformatics Algorithms</strong> by
Compeau & Pevzner. This is the recommended text for COMP416 “Genome-Scale
Algorithms”, and I will give references to relevant chapters in the third
edition.</p>
<h1 id="software-for-the-course">Software for the course</h1>
<p>Algorithms and statistics will be demonstrated using Python. Don’t worry if you are not fluent in either language, as no programs will have to be written from scratch.</p>
<p>The <a href="http://www.numpy.org/">NumPy</a> library for scientific computing will be used with Python. To install NumPy, first install the latest official distribution of Python 3. This can be downloaded for <a href="https://www.python.org/downloads/mac-osx/">macOS</a> or for <a href="https://www.python.org/downloads/windows/">Windows</a> from Python.org, and should already be included with your operating system if you are using Linux.</p>
<p>Then simply use the Python package manage pip to install NumPy from the command line, by running <code class="highlighter-rouge">pip3 install numpy</code>.</p>
<h1 id="schedule">Schedule</h1>
<p>The course is organized around four themes;</p>
<ol>
<li>Models and algorithms used for sequence alignment</li>
<li>Hidden Markov Models in computational biology</li>
<li>Phylogenetic parsimony and likelihood</li>
<li>Tree search methods</li>
</ol>
<p>Each theme will have a corresponding homework assignment. Themes 1 and 2 will be covered in the first midterm, and themes 3 and 4 in the second midterm.</p>
<p><em>The below schedule may change subject to Rice University policy</em></p>
<table>
<thead>
<tr>
<th>Week</th>
<th>Tuesday class</th>
<th>Thursday class</th>
<th>Homework</th>
</tr>
</thead>
<tbody>
<tr>
<td>08/26/2019</td>
<td>No class</td>
<td>Introduction, genomes, central dogma and homology</td>
<td> </td>
</tr>
<tr>
<td>09/02/2019</td>
<td>Empirical substitution matrices<sup>1</sup></td>
<td>Global alignment<sup>1</sup></td>
<td> </td>
</tr>
<tr>
<td>09/09/2019</td>
<td>Local alignment<sup>1</sup></td>
<td>BLAST and BLAT<sup>1</sup></td>
<td> </td>
</tr>
<tr>
<td>09/16/2019</td>
<td>PSSMs<sup>1</sup></td>
<td>Pseudocounts<sup>1</sup></td>
<td>#1 issued</td>
</tr>
<tr>
<td>09/23/2019</td>
<td>Hidden markov models<sup>2</sup></td>
<td>Viterbi algorithm<sup>2</sup></td>
<td> </td>
</tr>
<tr>
<td>09/30/2019</td>
<td>Forward algorithm<sup>2</sup></td>
<td>Backward algorithm<sup>2</sup></td>
<td>#1 due</td>
</tr>
<tr>
<td>10/07/2019</td>
<td>Applications of HMMs<sup>2</sup></td>
<td>Midterm review<sup>1,2</sup></td>
<td>#2 issued</td>
</tr>
<tr>
<td>10/14/2019</td>
<td>Midterm recess</td>
<td>Midterm exam<sup>1,2</sup></td>
<td> </td>
</tr>
<tr>
<td>10/21/2019</td>
<td>Phylogenetic trees<sup>3</sup></td>
<td>Post-midterm review<sup>3</sup></td>
<td>#2 due</td>
</tr>
<tr>
<td>10/28/2019</td>
<td>Equal-cost parsimony<sup>3</sup></td>
<td>Unequal cost parsimony<sup>3</sup></td>
<td> </td>
</tr>
<tr>
<td>11/04/2019</td>
<td>Likelihood of two sequences<sup>3</sup></td>
<td>Felsenstein’s pruning algorithm<sup>3</sup></td>
<td> </td>
</tr>
<tr>
<td>11/11/2019</td>
<td>The Felsenstein zone<sup>3</sup></td>
<td>Hill climbing and MCMC<sup>4</sup></td>
<td>#3 issued</td>
</tr>
<tr>
<td>11/18/2019</td>
<td>UPGMA and neighbor joining<sup>4</sup></td>
<td>Molecular clocks<sup>4</sup></td>
<td>#4 issued</td>
</tr>
<tr>
<td>11/25/2019</td>
<td>Course review<sup>4</sup></td>
<td>Thanksgiving recess</td>
<td>#3 due</td>
</tr>
<tr>
<td>12/02/2019</td>
<td>No class</td>
<td>Final exam<sup>3,4</sup></td>
<td>#4 due</td>
</tr>
</tbody>
</table>
<p>Superscript numbers refer to the theme(s) for that day’s class or midterm. Assignments will be both issued and due before midnight on Sundays.</p>
<h1 id="grade-policies">Grade policies</h1>
<ul>
<li>First in-class midterm: 25%</li>
<li>Second in-class midterm: 25%</li>
<li>Four homework assignments: 12.5% each</li>
</ul>
<p>Students with a strong and valid excuse for not attending a midterm will
be allowed to pick from one of the following options:</p>
<ul>
<li>Sit the midterm on a different day or time</li>
<li>Adjust their grading to increase the contribution of the corresponding homework assignments to match the midterm’s contribution</li>
<li>Adjust their grading to double the contribution of the alternate midterm</li>
</ul>
<p>Students with a strong and valid excuse for being unable to submit a homework
assignment will be allowed to pick from one of the following options:</p>
<ul>
<li>Submit the homework assignment on a later day and time</li>
<li>Adjust their grading to increase the contribution of the other homework assignments to match the assignment contribution</li>
<li>Adjust their grading to increase the contribution of the corresponding midterm to match the assignment contribution</li>
</ul>
<p>For both assignments and midterms the strength and validity of excuses, and
which of the above options are made available, will be solely the instructor’s
purview. Without a strong and valid excuse, a penalty of 10 percentage points
per day (which is equivalent to 1.25 points off the final course percent per
day) will be applied to any assignment submitted after the deadline.</p>
<h1 id="absence-policies">Absence policies</h1>
<p>Attendance is expected at every class. Attendance for the midterm exams is
compulsory and, without a strong and valid excuse, required to pass the
course even if a student would have otherwise received a passing grade.</p>
<h1 id="rice-honor-code">Rice Honor Code</h1>
<p>In this course, all students will be held to the standards of the Rice
Honor Code, a code that you pledged to honor when you matriculated at
this institution. If you are unfamiliar with the details of this code
and how it is administered, you should consult the Honor System Handbook
at <a href="http://honor.rice.edu/honor-system-handbook/">http://honor.rice.edu/honor-system-handbook/</a>.
This handbook outlines the University’s expectations for the integrity of your
academic work, the procedures for resolving alleged violations of those
expectations, and the rights and responsibilities of students and faculty
members throughout the process.</p>
<h1 id="students-with-a-disability">Students with a disability</h1>
<p>If you have a documented disability or other condition that may affect
academic performance you should: 1) make sure this documentation is on file
with Disability Support Services (Allen Center, Room 111 / <a href="mailto:adarice@rice.edu">adarice@rice.edu</a>
/ x5841) to determine the accommodations you need; and 2) talk with me to
discuss your accommodation needs.</p>Huw A. OgilvieImportant note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.Priors and clock models in StarBEAST2 tutorial2019-03-27T16:00:00+11:002019-03-27T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/oeb125/2019/03/27/clock-priors<h1 id="step-0-download-the-example-data-set">Step 0: Download the example data set</h1>
<p>The following programs should already be installed:</p>
<ul>
<li>BEAST 2</li>
<li>BEAUti 2</li>
<li>DensiTree</li>
<li>StarBEAST 2</li>
<li>Tracer</li>
</ul>
<p>The first three come with the BEAST 2 package. StarBEAST2 is an add-on for BEAST 2. Tracer is a separate program from BEAST 2, and can be used to inspect the output of any Bayesian program that uses the MCMC algorithm.</p>
<p>Download the <a href="http://www.cs.rice.edu/~ogilvie/assets/canis.zip">example archive</a>. This is a collection of multiple sequence alignments from species of <em>Canis</em> and closely related genera. After downloading the archive, extract it somewhere.</p>
<h1 id="step-1-open-the-starbeast2-template">Step 1: Open the StarBEAST2 template</h1>
<p>Open “BEAUti” (in the BEAST2 folder), which is a GUI application for configuring a BEAST2 analysis. Now select the StarBEAST2 template for multispecies coalescent analyses:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/open-sb2-template.png" alt="Open StarBEAST2 template" /></p>
<p>The title bar should have changed to “BEAUti 2: StarBeast2”</p>
<h1 id="step-2-import-multiple-sequence-alignments">Step 2: Import multiple sequence alignments</h1>
<p>Now import the multiple sequence alignments you previously downloaded and extracted. Either click the plus button in the bottom left, or use the Import Alignment menu item:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/import-alignment.png" alt="Import alignment menu option" /></p>
<p>Each locus has its own fasta file in the example data set. Select all of the loci to import, then select “all are nucleotide” when asked to specify the datatype:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/open.png" alt="Select all fasta files" /></p>
<h1 id="step-3-link-clock-models">Step 3: Link clock models</h1>
<p>All of the loci should appear in the main BEAUti window now. Select all of them, and choose to “Link Clock Models” using the button at the top. This will enable estimating a weighted average clock rate for all loci. After linking, all loci should be sharing the same clock model name:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/shared-clock-model.png" alt="Select all fasta files" /></p>
<h1 id="step-3-specify-taxon-sets">Step 3: Specify taxon sets</h1>
<p>Open the “Taxon sets” tab. This is where the mapping between the names used for gene sequences and the names of species is constructed. Click the button labelled “Guess” and select “Before Last”:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/before-last.png" alt="Select all fasta files" /></p>
<p>Click OK. Notice that the taxon names all had an “_x” on the end. This is because BEAST gets mad if the names of species and the names used for gene sequences are the same. Adding this suffix, and removing it in BEAUti to get the species names, is a way around that issue. Your taxon sets should look like this:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/taxon-sets-done.png" alt="Select all fasta files" /></p>
<h1 id="step-4-estimate-the-clock-rate">Step 4: Estimate the clock rate</h1>
<p>Open the Clock Model tab, and enable “estimate” next to the clock rate. Change the rate to 0.001 - this is the initial value of the rate, and changing it to a value closer to the posterior mode will help our analysis to converge quickly. The Clock Model tab should now look like this:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/clock-model-done.png" alt="Select all fasta files" /></p>
<h1 id="step-5-specify-the-clock-rate-prior">Step 5: Specify the clock rate prior</h1>
<p>Go to the Priors panel, and expand the “strictClockRate” prior. Change the mean “M” to 0.001, and the standard deviation “S” to 0.1. The prior distribution on clock rate should now look like this:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/clock-prior-distribution.png" alt="Select all fasta files" /></p>
<h1 id="step-6-save-and-launch">Step 6: Save and launch</h1>
<p>Save your configuration to a new folder, called something like “slow prior”. Launch BEAST, and open the configuration file you just created. Start the analysis, which will take 5 to 10 minutes to complete.</p>
<h1 id="step-7-rinse-and-repeat">Step 7: Rinse and repeat</h1>
<p>Repeat steps 1 through 6, but this time specify an initial clock rate of 0.01, and a clock rate mean “M” of 0.01. Use the same standard deviation “S” of 0.1. Make sure to save your new configuration file in a <strong>different</strong> folder, called something like “fast prior”.</p>
<h1 id="step-8-interrogate-the-results-using-tracer">Step 8: Interrogate the results using Tracer</h1>
<p>Open Tracer, and select Import Trace File from the file menu. Open the “starbeast.log” file from your “slow prior” folder. Then open the “starbeast.log” file from your “fast prior” folder in the same Tracer window. Highlight both trace files:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/tracer-posterior.png" alt="Posterior density box plots" />.</p>
<p>Select different statistics to see if their distributions are different between the analyses. In particular, look at the strictClockRate distributions (which should be the last statistic):</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/tracer-clock-rate.png" alt="Clock rate box plots" />.</p>
<h1 id="step-9-look-at-the-trees-in-densitree">Step 9: Look at the trees in DensiTree</h1>
<p>Once you are finished with Tracer, explore the different tree files for both analyses with DensiTree. The gene tree files are named based on the locus, and the species tree files are always called “species.trees”. For example, open the TRSP posterior distribution (\texttt{TRSP.trees}}) in DensiTree, and enable the “Full grid” so you can see the divergence times:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/trsp-slow.png" alt="Slow TRSP tree" />.</p>
<p>You can use DensiTree to calculate the posterior probability of clades. This is the probability that a group of sequences or species are monophyletic (share a single common ancestor to the exclusion of all other sequences or species in the data set), given the data set and the model. To do this is a little convoluted:</p>
<ol>
<li>Select the “Central” display mode (it won’t work with the default).</li>
<li>Enable the “clade toolbar” by selecting “view clade toolbar” in the window menu.</li>
<li>Open the “Clades” “folder” on the right hand side and enable “Show clades”.</li>
</ol>
<p>The size of each circle is somewhat proportional to the posterior probability of the corresponding clade, and the postion along the X-axis is the expectation of the age of that clade. Take a note of the posterior probability and heights of some of the clades, for example:</p>
<p><img src="http://www.cs.rice.edu/~ogilvie/assets/lupus-anthus-clade.png" alt="TRSP lupus anthus node" /></p>
<p>Now compare the clades you can see and their heights in this gene tree with the clades in the corresponding “fast prior” gene tree, and both the “fast” and “slow” species trees.</p>Huw A. OgilvieStep 0: Download the example data setMaximum Parsimony tutorial using PAUP2019-02-20T16:00:00+11:002019-02-20T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/phylogenetics-workshop/2019/02/20/paup-tutorial<h1 id="step-1-downloading-and-installing-software">Step 1: Downloading and installing software</h1>
<p>For this tutorial the programs we will use are <a href="http://doua.prabi.fr/software/seaview">SeaView</a>, <a href="https://paup.phylosolutions.com/">PAUP</a>, and
the text editor of your choice. SeaView has many uses, including:</p>
<ul>
<li>Viewing molecular sequences</li>
<li>Algorithmic alignment of molecular sequences</li>
<li>Manually editing and alignment of molecular sequences</li>
<li>Estimating phylogenetic trees from molecular sequences</li>
<li>Viewing phylogenetic trees</li>
</ul>
<p>If you are running Windows or macOS, you can download the latest version of
SeaView from the <a href="http://doua.prabi.fr/software/seaview">SeaView web site</a>. If you are running Ubuntu, then
SeaView is available from the package manager. You can install it from the
“Ubuntu Software” GUI, or manually using <code class="highlighter-rouge">apt install seaview</code>.</p>
<p>PAUP is used to infer trees from molecular data, and incorporates many
different methods and models for doing so. These include:</p>
<ul>
<li>Maximum parsimony</li>
<li>Maximum likelihood</li>
<li>Distance based methods like neighbor-joining</li>
<li><a href="https://doi.org/10.1093/bioinformatics/btu530">SVDquartets</a>, which is <a href="https://doi.org/10.1101/523050">statistically consistent with the multispecies coalescent</a></li>
</ul>
<p>If you are running macOS or Linux, please download the latest <strong>command line</strong>
version of PAUP for your platform from the <a href="http://phylosolutions.com/paup-test/">PAUP test-version downloads</a>
web site. Extract PAUP, and make sure the program is executable by opening the
command line, navigating to the directory it was stored in, and running
<code class="highlighter-rouge">chmod +x paup4a164_ubuntu64</code> on Ubuntu or <code class="highlighter-rouge">chmod +x paup4a164_osx</code> on
macOS. If you are running Windows, download the Windows GUI version from the
same web site.</p>
<p>If you do not have a favorite text editor already, I recommend
<a href="https://www.sublimetext.com/3">Sublime Text</a> or <a href="https://code.visualstudio.com/">Visual Studio Code</a>. You can download and install either
program from their respective web sites.</p>
<p>After downloading the software, download the <a href="http://www.cs.rice.edu/~ogilvie/assets/phylogenetics-workshop.zip">workshop materials</a>
archive to your computer, and extract its contents.</p>
<h1 id="step-2-exploring-the-true-tree-and-sequence-data">Step 2: Exploring the true tree and sequence data</h1>
<p>Launch SeaView, and then open the <code class="highlighter-rouge">fz.tree</code> file in the
<code class="highlighter-rouge">phylogenetics-workshop</code> folder. This will show you an ultrametric tree that
was randomly generated for this workshop (using a coalescent model).</p>
<p>Still in SeaView, open the <code class="highlighter-rouge">fz.nexus</code> multiple sequence alignment file. This
is a 100,000 character alignment generated based on the tree you just opened,
and using a Jukes-Cantor model of molecular evolution.</p>
<h1 id="step-3-inferring-the-maximum-parsimony-tree-with-paup">Step 3: Inferring the maximum parsimony tree with PAUP</h1>
<p>We will use PAUP to infer a phylogenetic tree. Open the command line on your
computer, and navigate to the extracted <code class="highlighter-rouge">phylogenetics-workshop</code> folder. On
Windows, run <code class="highlighter-rouge">paup fz.nexus</code>. On macOS or Linux, replace <code class="highlighter-rouge">paup</code> with the
path to the PAUP executable on your computer. For example if you saved it
to the Downloads folder on a Mac, this might be <code class="highlighter-rouge">~/Downloads/paup4a164_osx</code>.
Run the following lines of PAUP code:</p>
<ol>
<li><code class="highlighter-rouge">Set Criterion=Parsimony;</code></li>
</ol>
<p>This tells PAUP that the parsimony score of a tree should be used to judge
its goodness of fit.</p>
<ol>
<li><code class="highlighter-rouge">BandB;</code></li>
</ol>
<p>This command will identify the best fitting tree according to the parsimony
criterion. Normally we have to use some kind of stochastic algorithm like
hill-climbing or MCMC to infer trees, as the number of possible trees is so
large. Because this data set is relatively small (100,000 sites and 12 taxa),
we can instead use an exact “branch-and-bound” algorithm.</p>
<ol>
<li><code class="highlighter-rouge">SaveTrees file=mp.tree replace=yes;</code></li>
</ol>
<p>Save the inferred tree as a file with the name <code class="highlighter-rouge">mp.tree</code>.</p>
<ol>
<li><code class="highlighter-rouge">Quit;</code></li>
</ol>
<p>Should be self-explanatory.</p>
<h1 id="step-3-exploring-the-inferred-tree">Step 3: Exploring the inferred tree</h1>
<p>Open the inferred tree in SeaView. Make sure the true tree is still open. The
kind of inference we used produces an unrooted tree without branch lengths, so
you may have to reroot it or rotate nodes in SeaView. Experiment with
the “Swap” and “Re-root” options in SeaView so that the trees match.</p>
<p>What if any nodes are different between the truth and the estimated tree
topology?</p>Huw A. OgilvieStep 1: Downloading and installing software