Jekyll2019-12-08T17:36:06+11:00http://www.cs.rice.edu/~ogilvie/feed.xmlSpecies and Gene EvolutionThis is my Rice University web site, where I will share information about my research and teaching.Huw A. OgilvieCalculating the likelihood for an ultrametric tree (example)2019-12-04T16:00:00+11:002019-12-04T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/04/ultrametric-likelihood-example<p>In this example we will calculate the likelihood <script type="math/tex">P(D|T,h)</script> where <script type="math/tex">D</script> is a single site, <script type="math/tex">T</script> is a rooted tree topology, and <script type="math/tex">h</script> is the node heights for the tree topology. Since we are using node heights instead of branch lengths, and if we make the node heights at the tips all zeros, the tree is necessarily ultrametric. The site pattern, topology and branch lengths correspond to the following tree:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-0.png" alt="Example ultrametric tree" /></p> <p>The node heights <script type="math/tex">\tau</script> are given in some unit of time <script type="math/tex">t</script> before present. As long as the substitution rate <script type="math/tex">\mu</script> is constant across the tree (i.e. we are assuming a strict molecular clock), there are three unique branch lengths <script type="math/tex">l = \mu t</script> in expected substitutions per site. In this example we assume a constant rate <script type="math/tex">\mu = 0.1</script>.</p> <p>The branch lengths of humans and chimps in substitutions per site are both 0.1, the branch length of the ancestor of humans and chimps (HC) is 0.2, and the branch length of gorillas is 0.3. We will calculate the likelihood under the Jukes–Cantor model, so we only have to calculate the probability of the state being the same by the end of a branch (e.g. A to A), and the probability of the state being something else (e.g. A to C), given the state at the beginning and the branch length.</p> <p>For the human and chimp branches, these will be (to four decimal places):</p> <script type="math/tex; mode=display">P_{xx}(0.1) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.1}\right) = 0.9064</script> <script type="math/tex; mode=display">P_{xy}(0.1) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.1}\right) = 0.0312</script> <p>For the HC branch, these will be:</p> <script type="math/tex; mode=display">P_{xx}(0.2) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.2}\right) = 0.8245</script> <script type="math/tex; mode=display">P_{xy}(0.2) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.2}\right) = 0.0585</script> <p>For the gorilla branch, these will be:</p> <script type="math/tex; mode=display">P_{xx}(0.3) = \frac{1}{4}\left(1 + 3e^{-\frac{4}{3}0.3}\right) = 0.7528</script> <script type="math/tex; mode=display">P_{xy}(0.3) = \frac{1}{4}\left(1 - e^{-\frac{4}{3}0.3}\right) = 0.0824</script> <p>For the tip nodes, the partial likelihoods are 1 for the observed states, and 0 otherwise:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-1.png" alt="Tip partial likelihoods" /></p> <p>For each internal node we have to consider the left and right children separately. We will start off by calculating the partial likelihood of state A of the HC internal node. Beginning with the left child (humans), the probability of the child state being A is the probability of the end state being the same (as calculated above), multiplied by the partial likelihood of the child state. This is <script type="math/tex">0.9064 \times 1 = 0.9064</script>. For child states C, G and T, the probabilities will be <script type="math/tex">0.0312 \times 0 = 0</script>, so the probability for the left child branch integrating over all child states is <script type="math/tex">0.9064 + 0 + 0 + 0 = 0.9064</script>.</p> <p>The right child (chimpanzees) has the same branch length and partial likelihoods, so its probability will also be <script type="math/tex">0.9064</script>, and the partial likelihood of state A for the HC node will be <script type="math/tex">0.9064 \times 0.9064 = 0.8215</script>. We use the product because we want to calculate the probability of the left <strong>and</strong> right subtree states.</p> <p>For state C in the HC node, the probability along the left branch for child state A will be <script type="math/tex">0.0312 \times 1 = 0.0312</script>. The probability for state C will be <script type="math/tex">0.9064 \times 0 = 0</script>, and for states G and T will be <script type="math/tex">0.0312 \times 0 = 0</script>. So the probability for the left branch integrating over child states will be <script type="math/tex">0.0312</script>. Again the right branch will be the same, so the partial likelihood of state C will be <script type="math/tex">0.0312 \times 0.0312 = 0.00097</script></p> <p>Because of the equal base frequencies and equal rates assumption in Jukes–Cantor, the partial likelihoods of G and T will be the same as for C.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-2.png" alt="Human--chimp partial likelihoods" /></p> <p>Now for state A at the root, the probability along the left branch for child state A will be the probability of the state remaining the same given a branch length of 0.2, multiplied by the partial likelihood of state A for the HC node, or <script type="math/tex">0.8245 \times 0.8215 = 0.6773</script>. For child states C, G and T it will be <script type="math/tex">0.0585 \times 0.00097 = 0.000057</script>, which is the probability of the state being different at the end given a branch length of 0.2 multiplied by the partial likelihoods. So the probability along the left branch for state A at the root integrating over the left child states will be <script type="math/tex">0.6773 + 3 \times 0.000057 = 0.6775</script>.</p> <p>For the right child (gorillas) only state C has a non-zero partial likelihood, so we should multiply the above by the probability of a different state given the branch length 0.3 to get the partial likelihood of state A at the root, which will be <script type="math/tex">0.6775 \times 0.0824 = 0.0558</script>.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-3.png" alt="Root state A partial likelihood" /></p> <p>For state C at the root, the probability of child state A along the left (HC) branch will be <script type="math/tex">0.0585 \times 0.8215 = 0.0481</script>, the probability of child state C will be <script type="math/tex">0.8245 \times 0.00097 = 0.0008</script>, and the probabilities of child states G or T will be <script type="math/tex">0.0585 \times 0.00097 = 0.000057</script>. So integrating over the child states for the left branch, the probability will be <script type="math/tex">0.0481 + 0.0008 + 2 \times 0.000057 = 0.0490</script>. Again because of the symmetry in Jukes–Cantor, the probability along the left branch will be the same for root states G and T.</p> <p>However for state C at the root, the probability along the right (gorilla) branch will be the probability of the <em>same</em> state at the end given a branch length of 0.3, but for states G and T the probabilities will be for a <em>different</em> state. So for state C at the root the partial likelihood will be <script type="math/tex">0.0490 \times 0.7528 = 0.0369</script>, but for states G and T their partial likelihoods will be <script type="math/tex">0.0490 \times 0.0824 = 0.0040</script>.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/ultrametric-likelihood-4.png" alt="Root state partial likelihoods" /></p> <p>Each partial likelihood for a node <script type="math/tex">n</script> is conditioned on the state <script type="math/tex">k</script> at that node <script type="math/tex">P(D|n=k,T,h)</script>, but to calculate the likelihood at a node <script type="math/tex">P(D|T,h)</script> we need to integrate over the probabilities <script type="math/tex">P(D,n=k|T,h)</script> for each state at that node. Following the chain rule, we can convert the conditional likelihoods to joint probabilities by multipling the partials by the base (stationary) frequencies. For Jukes–Cantor the base frequencies are all equal and hence <script type="math/tex">\frac{1}{4}</script> given there are 4 nucleotide states.</p> <p>So we can calculate the likelihood for the entire tree by summing the root partial likelihoods and dividing by 4. For this tree and site, the likelihood <script type="math/tex">P(D|T,h) = \frac{0.0588 + 0.0369 + 2 \times 0.0040}{4} = 0.0252</script>.</p>Huw A. OgilvieIn this example we will calculate the likelihood where is a single site, is a rooted tree topology, and is the node heights for the tree topology. Since we are using node heights instead of branch lengths, and if we make the node heights at the tips all zeros, the tree is necessarily ultrametric. The site pattern, topology and branch lengths correspond to the following tree:Hill climbing and NNI2019-12-02T16:00:00+11:002019-12-02T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/02/hill-climbing<p>The Sankoff algorithm can efficiently calculate the parsimony score of a tree topology. Felsenstein’s pruning algorithm can efficiently calculate the probability of a multiple sequence alignment given a tree with branch lengths and a substitution model. But how can the tree with the lowest parsimony score, or highest likelihood, or highest posterior probability be identified?</p> <p>Possibly the simplest algorithm that can do this for most kinds of inference is hill-climbing. This algorithm basically works like this for <strong>maximum likelihood</strong> inference:</p> <ol> <li>Initialize the parameters <script type="math/tex">\theta</script></li> <li>Calculate the likelihood <script type="math/tex">L = P(D\vert\theta)</script></li> <li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li> <li>Calculate the likelihood <script type="math/tex">L' = P(D\vert\theta')</script></li> <li>If <script type="math/tex">L' > L</script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">L \leftarrow L'</script></li> <li>If stopping criteria are not met, go to 3</li> </ol> <p>You may notice that without <strong>stopping criteria</strong>, the algorithm is an infinite loop. How do we know when to give up? Three obvious criteria that can be used are:</p> <ol> <li>Stop after a certain number of proposals are rejected in a row (without being interrupted by any successful proposals)</li> <li>Stop after running the algorithm for a certain length of time</li> <li>Stop after running the algorithm for a certain number of iterations through the loop</li> </ol> <p>For <strong>maximum <em>a posteriori</em></strong> inference, we also need to calculate the prior probability <script type="math/tex">P(\theta)</script>. Because the marginal likelihood <script type="math/tex">P(D)</script> does not change, following Bayes’ rule the posterior probability <script type="math/tex">P(\theta\vert D)</script> is proportional to <script type="math/tex">P(D\vert\theta)P(\theta)</script>, which we might call the unnormalized posterior probability. So instead of maximizing the likelihood, we instead maximize the product of the likelihood and prior, which we have to recalculate for each proposal. The algorithm becomes:</p> <ol> <li>Initialize the parameters <script type="math/tex">\theta</script></li> <li>Calculate the unnormalized posterior probability <script type="math/tex">P = P(D\vert\theta)P(\theta)</script></li> <li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li> <li>Calculate the unnormalized posterior probability <script type="math/tex">P' = P(D\vert\theta')P(\theta')</script></li> <li>If <script type="math/tex">P' > P</script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">P \leftarrow P'</script></li> <li>If stopping criteria are not met, go to 3</li> </ol> <p>For <strong>maximum parsimony</strong> inference, we simply need to calculate the parsimony score of our parameters, so I will describe this as a function <script type="math/tex">f(D,\theta)</script> which returns the parsimony score. The algorithm becomes:</p> <ol> <li>Initialize the parameters <script type="math/tex">\theta</script></li> <li>Calculate the parsimony score <script type="math/tex">S = f(D,\theta)</script></li> <li>Propose a small modification to <script type="math/tex">\theta</script> and call it <script type="math/tex">\theta'</script></li> <li>Calculate the parsimony score <script type="math/tex">S' = f(D,\theta')</script></li> <li>If <script type="math/tex">% <![CDATA[ S' < S %]]></script>, accept <script type="math/tex">\theta \leftarrow \theta'</script> and <script type="math/tex">S \leftarrow S'</script></li> <li>If stopping criteria are not met, go to 3</li> </ol> <p>Note that the inequality is reversed in step 5 for maximum parsimony. These are all described for general cases, but for phylogenetic inference $\theta$ will correspond to a tree topology, and possibly branch lengths (for non-ultrametric trees) or node heights (for ultrametric trees). Maximum parsimony is unaffected by branch lengths, so $\theta$ is only the tree topology. Proposing changes to branch lengths or node heights is relatively simple because we can use some kind of uniform, Gaussian or other proposal distribution. But how do we propose a small change to the tree topology?</p> <p>A huge amount of research has gone into tree changing “operators,” but the simplest and most straightforward is nearest-neighbor interchange, or NNI. This works by isolating an internal branch of a tree, which for an unrooted tree always has four connected branches. The four nodes at the end of the connected branches may be tips or other internal nodes, because NNI can work on trees of any size.</p> <p>One of the nodes is fixed in place (in this example, humans), and its sister node is exchanged with one of the two other nodes.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/unrooted-nni.png" alt="Unrooted NNI" /></p> <p>For example the NNI move from the tree at the top to the tree in the bottom-right exchanges mouse (M) with chimpanzee (C), causing the sister of humans to change from chimps to mice. For four taxon trees there are only three topologies, and they are all connected by a single NNI move. For five taxon unrooted trees there are fifteen topologies and not all are connected:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/five-taxon-nni-space.png" alt="Five-taxon trees" /></p> <p>In the above example, each gray line represents an NNI move between topologies, and there is (made-up) parsimony scores above each topology. There are two peaks in parsimony score, one for the tree (((A,E),D),(B,C)) where the parsimony score is 1434, and one for the tree (((B,E),D),(A,C)) where the parsimony score is 1435. Since the second peak has a higher parsimony score, it is a local and not the global optimal solution.</p> <p>This illustrates the biggest problem with hill climbing. Because we only accept changes that improve the score, once we reach a peak where all connected points in parameter space (unrooted topologies in this case) are worse, then we can never climb down. Imagine we initialized our hill climbing using the topology indicated by the black arrow. By chance we could have followed the red path to the globally optimal solution… or the blue path to a local optimum.</p> <p>One straightforward way to address this weak point is to run hill climbing <strong>multiple times</strong>. The likelihood, unnormalized posterior probability or parsimony scores of the final accepted states for each hill climb can be compared, and the best solution out of all runs accepted, in the hope that it corresponds to the global optimum.</p> <p>What about NNI for <strong>rooted trees</strong>? It works in a very similar way, but we have to pretend that there is an “origin” tip <em>above</em> the root node, and perform the operation on the unrooted equivalent of the rooted tree. Here I use the example of three taxon rooted trees, and in this example I fix the origin.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/rooted-nni.png" alt="Unrooted NNI" /></p> <p>For three taxon rooted trees, there is one internal branch. In this example, the “sister” to the origin for the tree on the left is humans, so the NNI operations exchange humans with either chimps (becoming the tree on the right), or with mice (becoming the tree on the bottom).</p> <p>And how do we <strong>initialize</strong> hill climbing in phylogenetics? There are a few ways.</p> <ol> <li>Randomly generate a tree using simulation</li> <li>Permute the taxon labels on a predefined tree</li> <li>Use neighbor-joining if the tree is unrooted</li> <li>Use UPGMA if the tree is rooted</li> </ol> <p>The latter two methods have the advantage of starting closer to the optimal solutions, reducing the time required for a single hill climb. However when running hill climbing multiple times, the first two methods have the advantage of making the different runs more independent of each other, and therefore more likely for one to find the global optimum.</p>Huw A. OgilvieThe Sankoff algorithm can efficiently calculate the parsimony score of a tree topology. Felsenstein’s pruning algorithm can efficiently calculate the probability of a multiple sequence alignment given a tree with branch lengths and a substitution model. But how can the tree with the lowest parsimony score, or highest likelihood, or highest posterior probability be identified?Long branch attraction (in the Felsenstein zone)2019-12-01T16:00:00+11:002019-12-01T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/12/01/long-branch-attraction<p>Long branch attraction is the phenomenon where two branches which are in truth not sisters are inferred to be sister branches when using maximum parsimony inference. This occurs because, unlike likelihood, parsimony does not take into account branch lengths when computing the parsimony score.</p> <p>Maximum likelihood inference considers all sites when calculating the likelihood, but only so-called “parsimony informative sites” will end up determining the tree inferred using maximum parsimony. These are sites where at least two tips share a state, and at least two other tips share a state which is different from the first state.</p> <p>Consider the case of humans, chimps, rats and mice. In truth, humans and chimps should be sisters, as should rats and mice. The parsimony informative sites that support the true tree topology will therefore be those where humans and chimps share a state, and rats and mice share a state which is different from the human/chimp state (site patterns on the left in the below figure).</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/informative-sites.png" alt="Informative site patterns" /></p> <p>The score of those sites given the true topology (top-left in the above figure) will be 1 for equal-cost parsimony. Given one of the two incorrect unrooted topologies (middle-left and bottom-left), the score of those sites will be 2, because at least two mutations along the tree are required to explain the site pattern.</p> <p>For the uninformative sites, e.g. if we give mice a different state from every other species (site patterns on the right), at least two mutations will be required for all topologies and the score will always be 2 (see trees on the right). The contribution of these sites is therefore a constant and does not affect the inference.</p> <p>So if the number of parsimony informative site patterns supporting one of the incorrect topologies is greater than the number of informative site patterns supporting the true topology, the best scoring topology will be incorrect and our inferred topology will be wrong.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/felsenstein-zone.png" alt="Felsenstein zone" /></p> <p><em>Felsenstein zone trees with branch lengths in substitutions per site</em></p> <p>How can this be possible? Consider the above-right tree. Because the internal branch is short, and the chimp and mouse branches are also short, the probability of mutation along those three branches is minimal. Chimps and mice are therefore likely to share a state. But because the human and rat branches are long, the probability of mutation is high.</p> <p>Given a lack of mutation elsewhere, if a mutation or mutations in the human and rat branches cause the human and rat states to differ, the site will be uninformative. But if convergent mutations occur, the resulting site will be parsimony informative and support the incorrect topology where humans and rats are sister species (for example, the above site patterns).</p> <p>These sites will contribute a score of 2 to the true topology and a score of 1 to the human-rat topology when using equal-cost parsimony, the inverse of the contribution from parsimony informative sites that support the true human-chimp topology. So if more of the human-rat supporting sites are in a data set than human-chimp supporting sites, the wrong topology will be inferred using maximum parsimony.</p> <p>How likely is this to occur? I simulated sequence alignments for a range of branch lengths, beginning with the above-left branch lengths, gradually increasing the human and rat lengths (l1) while decreasing the chimp and mouse lengths (l2), ending with the above-right branch lengths. The internal branch length was always 0.1 substitutions per site. Jukes-Cantor was used as the substitution model, 1 million sites were simulated per alignment. For each set of branch lengths I counted the percentage of parsimony informative sites supporting the correct topology and the percentage supporting the human-rat or human-mouse topologies.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/pi-site-support.png" alt="Parsimony informative site support" /></p> <p>You can see that when l1 is greater than somewhere between 0.75 and 0.8 or less than somewhere between 0.3 and 0.35, the number of parsimony informative sites supporting the human-rat topology becomes greater than the number supporting the human-rat topology. These crossovers mark the border of the Felsenstein zone.</p> <p>For both Dollo and equal rates models of evolution, whether a four-taxon tree is in the Felsenstein zone can be tested analytically rather than by simulation. For details, see Felsenstein’s paper, “Cases in which parsimony or compatibility methods will be positively misleading,” published in Systematic Zoology (now known as Systematic Biology) in 1978.</p>Huw A. OgilvieLong branch attraction is the phenomenon where two branches which are in truth not sisters are inferred to be sister branches when using maximum parsimony inference. This occurs because, unlike likelihood, parsimony does not take into account branch lengths when computing the parsimony score.Likelihood of a tree2019-11-27T16:00:00+11:002019-11-27T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/27/likelihood-of-a-tree<p>The likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree topology, branch lengths and substitution model. An efficient dynamic programming algorithm to compute this probability was first developed by <a href="https://doi.org/10.1093/sysbio/22.3.240">Felsenstein in 1973</a>, and is quite similar to the algorithm used to infer unequal-cost parsimony scores developed by <a href="https://www.jstor.org/stable/2100459">Sankoff in 1975</a>.</p> <p>As with the Sankoff algorithm, a vector is associated with each node of the tree. Each element of the vector stores the probability of observing the tip states, given the tree below the associated node and the state corresponding to the element (the first, second, third and fourth elements usually correspond to A, C, G and T for DNA).</p> <p>Those probabilities marginalize over all possible states at every internal node below the root of the subtree. These are known as partial likelihoods, and are in contrast with the vector elements of the Sankoff algorithm, which are calculated only from the states which minimize the total cost. We might write the partial likelihood for state <script type="math/tex">k</script> at node <script type="math/tex">n</script> as:</p> <script type="math/tex; mode=display">P_{n,k} = P(D_i|k, T, l, M)</script> <p>where <script type="math/tex">D_i</script> is the tip states at position <script type="math/tex">i</script> of the multiple sequence alignment or character matrix, <script type="math/tex">T</script> is the topology of the subtree under the node, <script type="math/tex">l</script> is the branch lengths of the subtree, and <script type="math/tex">M</script> is the substitution model. I will go over the five key differences between the two algorithms.</p> <p><strong>One.</strong> For the Sankoff algorithm the elements in the vectors at the tips are initialized to either zero for the observed states or infinity otherwise, because the only the observed state can be the state at the tips. However because partial likelihoods are probabilities not costs, for likelihood they are initialized to 1 for 100% probability (or 0 if working in log space) for the observed states, and 0 for 0% probability (or negative infinity if working in log space).</p> <p><strong>Two.</strong> Because Felsenstein’s likelihood depends on branch lengths and not just topology, the transition probabilities must be recomputed for each branch. For the Jukes-Cantor model just two probabilties are needed because it assumes equal base frequencies and transition rates. The first is the probability of state <script type="math/tex">k</script> at the parent node and state <script type="math/tex">k'</script> at the child node being the same <strong>conditioned on</strong> the <script type="math/tex">k</script>:</p> <script type="math/tex; mode=display">P(k' = k|k) = P_{xx} = \frac{1}{4}(1 + 3 e^{-\frac{4}{3}\mu t})</script> <p>where $\mu t$ is the product of the substitution rate and length of the branch in time, which is the length of the branch in substitutions per site. And the second is the probability of the state at the child node being different, again conditioned on the state at the parent node:</p> <script type="math/tex; mode=display">P(k' \ne k|k) = P_{xy} = \frac{1}{4}(1 - e^{-\frac{4}{3}\mu t})</script> <p><strong>Three.</strong> Because the partial likelihoods marginalize over the internal node states, for each child branch the probabilities for all child node states must be summed over rather than finding the minimum cost. Using Jukes-Cantor, when calculating the partial likelihood for state <script type="math/tex">k</script> at node <script type="math/tex">n</script>, for the one case where the state <script type="math/tex">k'</script> at the child node <script type="math/tex">c</script> equals <script type="math/tex">k</script>, the probability is <script type="math/tex">P_{xx} P_{c,k'}</script>. For the three cases where it does not, the probabilities are <script type="math/tex">P_{xy}P_{c,k'}</script>. By summing all four probabilities, we marginalize over the possible states at that child node.</p> <p><strong>Four.</strong> Cost accumulates, but the joint probability of independent variables multiplies. So for parsimony the cost of the left and right subtrees under a node (stored in the vectors associated with the left and right children) and the cost of the mutations along the left and right child branches (if any) are all added together. But for likelihood the left and right marginal probabilities are multiplied. Why are left and right marginal probabilities independent? Because sequences evolve independently along left and right subtrees, conditioned on the state at the root.</p> <p>This also applies when calculating the cost or likelihood of a sequence alignment or character matrix. For maximum parsimony the cost accumulates for each additional site, so the parsimony score of an alignment is the sum of minimum costs for each site. But for maximum likelihood the likelihood of each site is a probability and we treat each site as evolving independently, so the likelihood for the alignment is the product of site likelihoods.</p> <p><strong>Five.</strong> For maximum parsimony, the smallest element of the root node vector gives the parsimony score of the tree. But for Felsenstein’s likelihood, we want to marginalize over root states, i.e. we want <script type="math/tex">P(D_i|T,l,M)</script> which does not depend on state <script type="math/tex">k</script> at the root. Given the RNA alphabet <script type="math/tex">\{A,C,G,U\}</script>, we can perform this marginalization by summing over the joint probabilities:</p> <script type="math/tex; mode=display">P(D_i|T,l,M) = P(D_i,k=A|T,l,M) + P(D_i,k=C|T,l,M) + P(D_i,k=G|T,l,M) + P(D_i,k=U|T,l,M)</script> <p>But the partial likelihoods at the root give us <script type="math/tex">P(D_i|k, T, l, M)</script>, where state <script type="math/tex">k</script> is on the right side of the conditional. We can use the chain rule to convert them to joint probabilities:</p> <script type="math/tex; mode=display">P(D_i,k|T,l,M) = P(D_i|k,T,l,M) \cdot P(k)</script> <p>but what is <script type="math/tex">P(k)</script>? It is the stationary frequency of the state, which for Jukes-Cantor is always <script type="math/tex">\frac{1}{4}</script>, so for that substitution model we just have to sum the partial likelihoods at the root and divide by four to get the likelihood of the tree.</p> <p>The following code will calculate the likelihood of a tree (in Newick format) for a multiple sequence alignment (MSA in FASTA format), with the paths to the tree and MSA files given as the first and second arguments to the program.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import ete3 import numpy import os.path import sys neginf = float("-inf") # used by read_fasta to turn a sequence string into a vector of integers based # on the supplied alphabet def vectorize_sequence(sequence, alphabet): sequence_length = len(sequence) sequence_vector = numpy.zeros(sequence_length, dtype = numpy.uint8) for i, char in enumerate(sequence): sequence_vector[i] = alphabet.index(char) return sequence_vector # this is a function that reads in a multiple sequence alignment stored in # FASTA format, and turns it into a matrix def read_fasta(fasta_path, alphabet): label_order = [] sequence_matrix = numpy.zeros(0, dtype = numpy.uint8) fasta_file = open(fasta_path) l = fasta_file.readline() while l != "": l_strip = l.rstrip() # strip out newline characters if l == "&gt;": label = l_strip[1:] label_order.append(label) else: sequence_vector = vectorize_sequence(l_strip, alphabet) sequence_matrix = numpy.concatenate((sequence_matrix, sequence_vector)) l = fasta_file.readline() fasta_file.close() n_sequences = len(label_order) sequence_length = len(sequence_matrix) // n_sequences sequence_matrix = sequence_matrix.reshape(n_sequences, sequence_length) return label_order, sequence_matrix # this is a function that reads in a phylogenetic tree stored in newick # format, and turns it into an ete3 tree object def read_newick(newick_path): newick_file = open(newick_path) newick = newick_file.read().strip() newick_file.close() tree = ete3.Tree(newick) return tree def recurse_likelihood(node, site_i, n_states): if node.is_leaf(): node.partial_likelihoods.fill(0) # reset the leaf likelihoods leaf_state = node.sequence[site_i] node.partial_likelihoods[leaf_state] = 1 else: left_child, right_child = node.get_children() recurse_likelihood(left_child, site_i, n_states) recurse_likelihood(right_child, site_i, n_states) for node_state in range(n_states): left_partial_likelihood = 0.0 right_partial_likelihood = 0.0 for child_state in range(n_states): if node_state == child_state: left_partial_likelihood += left_child.pxx * left_child.partial_likelihoods[child_state] right_partial_likelihood += right_child.pxx * right_child.partial_likelihoods[child_state] else: left_partial_likelihood += left_child.pxy * left_child.partial_likelihoods[child_state] right_partial_likelihood += right_child.pxy * right_child.partial_likelihoods[child_state] node.partial_likelihoods[node_state] = left_partial_likelihood * right_partial_likelihood # nucleotides, obviously alphabet = "ACGT" # A = 0, C = 1, G = 2, T = 3 n_states = len(alphabet) # this script requires a newick tree file and fasta sequence file, and # the paths to those two files are given as arguments to this script tree_path = sys.argv root_node = read_newick(tree_path) msa_path = sys.argv taxa, alignment = read_fasta(msa_path, alphabet) site_count = len(alignment) # the number of taxa, and the number of nodes in a rooted phylogeny with that # number of taxa n_taxa = len(taxa) n_nodes = n_taxa + n_taxa - 1 for node in root_node.traverse(): # initialize a vector of partial likelihoods that we can reuse for each site node.partial_likelihoods = numpy.zeros(n_states) # we can precalculate the pxx and pxy values for the branch associated with # this node node.pxx = (1 / 4) * (1 + 3 * numpy.exp(-(4 / 3) * node.dist)) node.pxy = (1 / 4) * (1 - numpy.exp(-(4 / 3) * node.dist)) # add sequences to leaves if node.is_leaf(): taxon = node.name taxon_i = taxa.index(taxon) node.sequence = alignment[taxon_i] # this will be the total likelihood of all sites log_likelihood = 0.0 for site_i in range(site_count): recurse_likelihood(root_node, site_i, n_states) # need to multiply the partial likelihoods by the stationary frequencies # which for Jukes-Cantor is 1/4 for all states log_likelihood += numpy.log(numpy.sum(root_node.partial_likelihoods * (1 / 4))) tree_filename = os.path.split(tree_path) msa_filename = os.path.split(msa_path) tree_name = os.path.splitext(tree_filename) msa_name = os.path.splitext(msa_filename) print("The log likelihood P(%s|%s) = %f" % (msa_name, tree_name, log_likelihood)) </code></pre></div></div>Huw A. OgilvieThe likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree topology, branch lengths and substitution model. An efficient dynamic programming algorithm to compute this probability was first developed by Felsenstein in 1973, and is quite similar to the algorithm used to infer unequal-cost parsimony scores developed by Sankoff in 1975.Equal-cost parsimony2019-11-26T16:00:00+11:002019-11-26T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/26/equal-cost-parsimony<p>The principle behind maximum parsimony based inference is to explain the data using the smallest cost. In its most basic form, all events are given equal cost, so a nucleotide changing from A to C (a transversion) is given the same cost as a change from C to T (a transition). Likewise the gain of a trait, e.g. flight, is given the same cost as the loss of that trait. In this case finding the explanation with the smallest cost is the same as finding the explanation with the smallest number of events. In a phylogenetic context, the explanation is the tree topology, and the events are mutations of molecular sequences or organismal traits.</p> <p>Equal cost parsimony can be solved using a simple procedure called the Fitch algorithm (<a href="https://doi.org/10.1093/sysbio/20.4.406">Fitch, 1971</a>). The output of this algorithm is the smallest number of events required to explain the pattern of one site or trait for a given tree topology.</p> <p>As an example, let’s consider a genomic position homologous between apes and rodents. At this position the nucleotide observed for humans and chimps is adenine (A), for gorillas and mice it is cytosine (C), and for rats it is guanine (G). We will compute the parsimony score for a given tree topology, in this case one what treats humans and chimps and sisters, and also mice and rats as sisters.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-0.png" alt="Topology and site pattern" /></p> <p>Like other dynamic programming algorithms for phylogenetic inference, we need initialize the values at each tip. For the Fitch algorithm, there are two different kinds of values at each node;</p> <ol> <li>a set of most parsimonious states given the site pattern and topology <strong>under that node</strong></li> <li>the minimum number of changes required to explain the site pattern under given the topology <strong>under that node</strong></li> </ol> <p>For the tip nodes, each set has a single element corresponding to the observed state, and the minimum number of changes is always zero.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-1.png" alt="Initial states" /></p> <p>Then we need to recurse through the internal nodes of the tree, always visiting children before parents. The most straightforward way to accomplish this is <a href="https://opendsa-server.cs.vt.edu/ODSA/Books/Everything/html/BinaryTreeTraversal.html">postorder traversal</a>. However, for this example we will use levelorder traversal, visiting the lowest level of nodes first, then the next lowest, until we get to the root.</p> <p>For each node we first calculate the intersection of the sets of most parsimonious states from the node’s children. For humans and chimps the intersection contains a single state “A”, but for rodents the intersection is empty.</p> <p>When the intersection is non-empty, we add all elements of the intersection to the set of most parsimonious states for a given node. A non-empty intersection also means that no changes are required along either branch leading to the children, as at least one most parsimonious state is present in all three sets (parent and two children).</p> <p>Since no changes are required, we calculate the parsimony score for that node (the minimum number of required changes) by simply adding the parsimony score for the two children. In the case of humans and chimps, the intersection is {“A”} and the sum of parsimony scores is 0.</p> <p>When the intersection is empty, we add all elements of the <em>union</em> to the set of most parsimonious states. For each state in the union, it will either be present in the parent and left child sets, or the parent and right child sets. In both cases we need at least one mutation to explain the pattern, but the mutation will be on the left or right branch respectively. So the parsimony score will be the sum of scores of the children, <em>plus one</em>. In the case of rodents, the union is {C, G} and the parsimony score will be 0 + 0 + 1 = 1.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-2.png" alt="Level 1" /></p> <p>For the ancestor of humans, chimps and gorillas (Homininae), the intersection of the human and chimp set on the left {A} and the gorilla set {C} is empty, so we use the union {A, C}. Since the intersection was empty, the parsimony score will be the sum of child scores plus one.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-3.png" alt="Level 2" /></p> <p>Finally at the root, the intersection of the ape set {A, C} and the rodent set {C, G} is nonempty, as C is present in both. So the most parsimonious state at the root will be C, and since this state is present in all three sets, we do not need to invoke changes and only need to sum the child scores. For this example this sum is 1 + 1 = 2.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/fitch-4.png" alt="Root" /></p> <p>Equal cost parsimony will derive the same score for any rooted tree with the same unrooted topology. In other words, neither the rooting nor the branch lengths affect the score in any way (at least in terms of inference). Given five taxa as in the above example, there are fifteen possible unrooted topologies:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/topologies.png" alt="Root" /></p> <p>I have given the parsimony score for each topology given the site pattern. In this case there are five maximum parsimony solutions, and we cannot distinguish between them. Luckily one of these is the “true in real life” tree topology for these organisms (left middle).</p> <p>The parsimony score of a multiple sequence alignment, or the character matrix of a set of traits, is the sum of parsimony scores for all sites in the alignment or all traits. By sampling enough sites and/or traits we should be able to identify a single optimal tree from its parsimony score.</p>Huw A. OgilvieThe principle behind maximum parsimony based inference is to explain the data using the smallest cost. In its most basic form, all events are given equal cost, so a nucleotide changing from A to C (a transversion) is given the same cost as a change from C to T (a transition). Likewise the gain of a trait, e.g. flight, is given the same cost as the loss of that trait. In this case finding the explanation with the smallest cost is the same as finding the explanation with the smallest number of events. In a phylogenetic context, the explanation is the tree topology, and the events are mutations of molecular sequences or organismal traits.Dollo’s law and unequal-cost parsimony2019-11-26T16:00:00+11:002019-11-26T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/11/26/unequal-cost-parsimony<p>Certain mutations are more surprising than others. DNA is composed of a string of nucleotides, which are either pyrimadines (cytosine or thymine) or purines (adenine or guanine). A single point mutation to DNA is either a <em>transition</em> from one pyrimadine to another or one purine to another, or a <em>transversion</em> from a purine to a pyrimadine or <em>vice versa</em>. Transitions are biochemically easier than transversions, and hence much more commonly occuring in the evolution of genomes.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/purines-pyrimadines.png" alt="Purines and pyrimadines" /></p> <p>Image from Wikipedia user Zephyris</p> <p>This principle also applies to traits. Dollo’s law states that complex characters, once lost from a lineage, are unlikely to be regained (<a href="https://doi.org/10.1002/jez.b.22642">Wright <em>et al</em>. 2015</a>, <a href="https://paleoglot.org/files/Dollo_93.pdf">Dollo 1893</a>). For example, the evolution of flight in bats required the evolution of multiple components like wing membranes, a novel complex of muscles and low-mass bones (<a href="https://doi.org/10.1002/wdev.50">Cooper <em>et al</em>. 2010</a>). Once any one of those components are lost the others are likely to be lost too. Because regaining the trait will require so many components to be regained, it is unlikely. Therefore we should be more surprised by a transition from flightlessness to flightedness than the reverse.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/bat-wing.jpg" alt="Bat wing skeleton" /></p> <p>Figure 1 from <a href="https://doi.org/10.1002/wdev.50">Cooper et al. (2010)</a> showing the thin elongated metacarpals and phalanges of Seba’s short‐tailed bat.</p> <p>Equal-cost parsimony, for example when using the Fitch algorithm, does not account for this kind of difference in expectations. However unequal-cost parsimony uses a cost matrix to assign different costs to different transitions. For the DNA evolution example, it might look something like this:</p> <table> <thead> <tr> <th style="text-align: right"> </th> <th style="text-align: right">A</th> <th style="text-align: right">C</th> <th style="text-align: right">G</th> <th style="text-align: right">T</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">A</td> <td style="text-align: right">0</td> <td style="text-align: right">5</td> <td style="text-align: right">1</td> <td style="text-align: right">5</td> </tr> <tr> <td style="text-align: right">C</td> <td style="text-align: right">5</td> <td style="text-align: right">0</td> <td style="text-align: right">5</td> <td style="text-align: right">1</td> </tr> <tr> <td style="text-align: right">G</td> <td style="text-align: right">1</td> <td style="text-align: right">5</td> <td style="text-align: right">0</td> <td style="text-align: right">5</td> </tr> <tr> <td style="text-align: right">T</td> <td style="text-align: right">5</td> <td style="text-align: right">1</td> <td style="text-align: right">5</td> <td style="text-align: right">0</td> </tr> </tbody> </table> <p>This cost matrix penalizes a transversion five times more than it penalizes a transition. For the trait evolution example, it might look something like this:</p> <table> <thead> <tr> <th style="text-align: right">.</th> <th style="text-align: right">+</th> <th style="text-align: right">-</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">+</td> <td style="text-align: right">0</td> <td style="text-align: right">1</td> </tr> <tr> <td style="text-align: right">-</td> <td style="text-align: right">Infinity</td> <td style="text-align: right">0</td> </tr> </tbody> </table> <p>In the above matrix, plus is used to indicate the presence of a trait (e.g. flight) and a minus indicates the absence. This kind of matrix is known as a Dollo model, where only forward transitions (from + to -, i.e. losing the trait) are allowed, and reverse transitions are prohibited. Using this model implies that the trait <em>must</em> have been present in the most recent common ancestor (MRCA) of all species in the tree, so it will be inappropriate to use when the trait was absent from the MRCA.</p> <p>The <a href="https://www.jstor.org/stable/2100459">Sankoff algorithm</a> uses dynamic programming to efficiently calculate the parsimony score for a given tree topology and cost matrix. Let’s use the DNA cost matrix above to demonstrate it.</p> <p>A vector is associated with every node of the tree. The size of the vector is the size of the alphabet for a character, so 2 for a binary trait like flight, 4 for DNA or 20 for proteins. Each element of the vector corresponds to one of the possible states for that character. Each element of the vector stores the parsimony score for the tree topology under a node, given the state at that node corresponding to the element, and the known tip states.</p> <p>To initialize the tip node vectors, set the cost for the elements corresponding to known tip states to zero. The other states are known to be not true, so they should never be considered. This could be achieved by setting their cost to infinity, represented here by dots.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-1.png" alt="Sankoff" /></p> <p>For each element of each internal node, we have to consider the cost of each possible transition for each child branch. The parsimony score for the element is the minimum possible cost for the left branch, plus the minimum possible cost for the right branch. The cost for each possible transition is the corresponding value from the cost matrix, plus the score in the corresponding child element.</p> <p>Consider the MRCA of humans and chimps. For state A, the cost of transitioning to A in humans will be 0 + 0 = 0, to C will be 5 + ∞ = ∞, to G will be 1 + ∞ = ∞, and to T will be 5 + ∞ = ∞. The minimum for the left branch for the left branch is therefore 0. Since chimps have the same state as humans in this example, the cost will be the same, and the sum of minimum costs will be 0.</p> <p>Repeat for C, G and T.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-2.png" alt="Sankoff" /></p> <p>Now consider the MRCA of humans, chimps and gorillas. For state A, the cost of transitioning to A in the human/chimp MRCA will be 0 + 0 = 0, to C will be 5 + 10 = 15, to G will be 1 + 2 = 3, and to T will be 5 + 10 = 15. So the minimum along the left branch is 0. The cost of transitioning from A to C in gorillas will be 5 + 0 = 5, and from A to other gorilla states will be ∞. Therefore the minimum cost along the right branch is 5, and the parsimony score for state A is 0 + 5 = 5.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-3.png" alt="Sankoff" /></p> <p>Repeat the above for the remaining nodes. Here we are walking the tree postorder, but like for the Fitch algorithm levelorder would work too.</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/sankoff-4.png" alt="Sankoff" /></p> <p>Finally, the parsimony score for the entire tree is the minimum score out of the root states - for this tree and site pattern, 10. As with equal-cost parsimony, the score for an entire multiple sequence alignment or character matrix is the sum of parsimony scores for each position or for each character respectively.</p>Huw A. OgilvieCertain mutations are more surprising than others. DNA is composed of a string of nucleotides, which are either pyrimadines (cytosine or thymine) or purines (adenine or guanine). A single point mutation to DNA is either a transition from one pyrimadine to another or one purine to another, or a transversion from a purine to a pyrimadine or vice versa. Transitions are biochemically easier than transversions, and hence much more commonly occuring in the evolution of genomes.Backward algorithm2019-10-13T16:00:00+11:002019-10-13T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/comp571/2019/10/13/backward-algorithm<p>Like the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algorithm, the backward algorithm is an instance of dynamic programming where the intermediate values are probabilities.</p> <p>Recall the forward matrix values can be specified as:</p> <p>f<sub><em>i</em>,<em>k</em></sub> = P(x<sub>1..<em>i</em></sub>,π<sub><em>i</em></sub>=k|M)</p> <p>That is, the forward matrix contains joint probabilities for the sequence up to the <em>i</em><sup>th</sup> position, and the state at that position being <em>k</em>. These joint probabilities are not conditional on the previous states, instead they are marginalizing over the hidden state path leading up to <em>i</em>,<em>k</em>.</p> <p>In contrast, the backward matrix contains probabilities for the sequence <em>after</em> the <em>i</em><sup>th</sup> position, and these probabilities are conditional on the state being <em>k</em> at <em>i</em>:</p> <p>b<sub><em>i</em>,<em>k</em></sub> = P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=k,M)</p> <p>To demonstrate the backward algorithm, we will use the same example sequence and HMM as for the Viterbi and forward algorithm demonstrations:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/cpg-island-hmm-log.png" alt="CG rich island HMM" /></p> <p>The backward matrix probabilities are marginalized over the hidden state path after <em>i</em>. To calculate them, initialize a backward matrix <em>b</em> of the same dimensions as the corresponding forward matrix. We will work in log space, so use negative infinity for the start state other than at the start position <em>i</em> = 0, and for the non-start states at the start position. The probability of an empty sequence after the last position is 100% regardless of the state at the last position, so fill in zeros for the non-start states at the last column <em>i</em> = <em>n</em>:</p> <table> <thead> <tr> <th style="text-align: right"> </th> <th style="text-align: right">{}</th> <th style="text-align: right">G</th> <th style="text-align: right">G</th> <th style="text-align: right">C</th> <th style="text-align: right">A</th> <th style="text-align: right">C</th> <th style="text-align: right">T</th> <th style="text-align: right">G</th> <th style="text-align: right">A</th> <th style="text-align: right">A</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">start</td> <td style="text-align: right"> </td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> </tr> <tr> <td style="text-align: right">CG rich</td> <td style="text-align: right">-∞</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right">0.0</td> </tr> <tr> <td style="text-align: right">CG poor</td> <td style="text-align: right">-∞</td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right"> </td> <td style="text-align: right">0.0</td> </tr> </tbody> </table> <p>To calculate the backward probabilities for a given non-start hidden state <em>k</em> at the second-to-last position <em>i</em> = <em>n</em> - 1 through to the position of the first character <em>i</em> = 1, gather the following log probabilities for each non-start hidden state <em>k’</em> at position <em>i</em> + 1:</p> <ol> <li>the emission probability e<sub><em>i</em>+1,<em>k’</em></sub> of the observed state (character) at <em>i</em> + 1 given <em>k’</em></li> <li>the hidden state transition probability t<sub><em>k</em>,<em>k’</em></sub> from state <em>k</em> at <em>i</em> to state <em>k’</em> at <em>i</em> + 1</li> <li>the probability <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> of the sequence after <em>i</em> + 1 given state <em>k’</em> at <em>i</em> + 1</li> </ol> <p>The sum of the above log probabilities gives us the probability for the character and hidden state at <em>i</em> + 1, given a particular state at <em>i</em>. Beginning at the second last position <em>i</em> = <em>n</em> - 1, moving from the CG rich state to the CG rich state, the log probabilities are e<sub><em>i</em>+1,<em>k’</em></sub> = -2, t<sub><em>k</em>,<em>k’</em></sub> = -0.5 and <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> = 0 respectively, and their sum is -2.5. The first two are from the HMM, and the last is from the backward matrix. When moving from the CG rich state to the CG poor state, they are -1, -1 and 0 respectively and the sum is -2.</p> <p>Finally, we can calculate <em>b</em><sub><em>i</em>,<em>k</em></sub> by marginalizing over both possible transitions. For the CG rich hidden state at the second-to-last position:</p> <p><em>b</em><sub><em>n</em>-1,<em>CG rich</em></sub> = log(P(x<sub><em>n</em></sub>|π<sub><em>n</em>-1</sub>=CG rich,M)) = log(e<sup>-2.5</sup> + e<sup>-2</sup>) = -1.5</p> <p>Likewise for the CG poor state at position <em>n</em> - 1, the marginal log probability is:</p> <p><em>b</em><sub><em>n</em>-1,<em>CG poor</em></sub> = log(P(x<sub><em>n</em></sub>|π<sub><em>n</em>-1</sub>=CG rich,M)) = log(e<sup>-3</sup> + e<sup>-1.5</sup>) = -1.3</p> <p>We can now update the matrix:</p> <table> <thead> <tr> <th style="text-align: right"> </th> <th style="text-align: right">{}</th> <th style="text-align: right">G</th> <th style="text-align: right">G</th> <th style="text-align: right">C</th> <th style="text-align: right">A</th> <th style="text-align: right">C</th> <th style="text-align: right">T</th> <th style="text-align: right">G</th> <th style="text-align: right">A</th> <th style="text-align: right">A</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">start</td> <td style="text-align: right"> </td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> </tr> <tr> <td style="text-align: right">CG rich</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-11.2</td> <td style="text-align: right">-9.9</td> <td style="text-align: right">-8.6</td> <td style="text-align: right">-7.0</td> <td style="text-align: right">-5.7</td> <td style="text-align: right">-4.1</td> <td style="text-align: right">-2.9</td> <td style="text-align: right">-1.5</td> <td style="text-align: right">0.0</td> </tr> <tr> <td style="text-align: right">CG poor</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-11.5</td> <td style="text-align: right">-10.1</td> <td style="text-align: right">-8.5</td> <td style="text-align: right">-7.2</td> <td style="text-align: right">-5.6</td> <td style="text-align: right">-4.3</td> <td style="text-align: right">-2.6</td> <td style="text-align: right">-1.3</td> <td style="text-align: right">0.0</td> </tr> </tbody> </table> <p>At the start position <em>i</em> = 0, the only valid hidden state is the start state. Therefore at that position we only need to calculate the probability of going from the start state to the CG rich or CG poor states. For moving to the CG rich state, the log probabilities are e<sub><em>i</em>+1,<em>k’</em></sub> = -1.0, t<sub><em>k</em>,<em>k’</em></sub> = -0.7 and <em>b</em><sub><em>i</em>+1,<em>k’</em></sub> = -11.2. For moving to the CG poor state, they are -2.0, -0.7 and -11.5 respectively. The sums are -12.9 and -14.2 respectively, and the log sum of exponentials is -12.7. We will use this to complete the backward matrix:</p> <table> <thead> <tr> <th style="text-align: right"> </th> <th style="text-align: right">{}</th> <th style="text-align: right">G</th> <th style="text-align: right">G</th> <th style="text-align: right">C</th> <th style="text-align: right">A</th> <th style="text-align: right">C</th> <th style="text-align: right">T</th> <th style="text-align: right">G</th> <th style="text-align: right">A</th> <th style="text-align: right">A</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">start</td> <td style="text-align: right">-12.7</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> </tr> <tr> <td style="text-align: right">CG rich</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-11.2</td> <td style="text-align: right">-9.9</td> <td style="text-align: right">-8.6</td> <td style="text-align: right">-7.0</td> <td style="text-align: right">-5.7</td> <td style="text-align: right">-4.1</td> <td style="text-align: right">-2.9</td> <td style="text-align: right">-1.5</td> <td style="text-align: right">0.0</td> </tr> <tr> <td style="text-align: right">CG poor</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-11.5</td> <td style="text-align: right">-10.1</td> <td style="text-align: right">-8.5</td> <td style="text-align: right">-7.2</td> <td style="text-align: right">-5.6</td> <td style="text-align: right">-4.3</td> <td style="text-align: right">-2.6</td> <td style="text-align: right">-1.3</td> <td style="text-align: right">0.0</td> </tr> </tbody> </table> <p>Because the only valid hidden state for the start position is the start state, the probability P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=k,M) can be simplified to P(x<sub><em>i</em>+1..<em>n</em></sub>|M). Because the sequence after the start position is the entire sequence, it can be further simplified to P(x|M). In other words, this probability is our marginal likelihood! While this is slightly different from the marginal likelihood of -12.6 derived using the forward algorithm, that is a rounding error caused by our limited precision of one decimal place.</p> <p>Why do we need two dynamic programming algorithms to compute the marginal likelihood? We don’t! But by combining probabilities from the two matrices, we can derive the posterior probability of each hidden state <em>k</em> at each position <em>i</em>, marginalized over all paths through <em>k</em> at <em>i</em>. How this this work? Let’s use Bayes’ rule to demonstrate:</p> <p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x|π<sub><em>i</em></sub>=<em>k</em>,M) × P(π<sub><em>i</em></sub>=<em>k</em>|M) / P(x|M)</p> <p>If two variables <em>a</em> and <em>b</em> are independent, their joint probability P(<em>a</em>,<em>b</em>) is simply the product of their probabilities P(<em>a</em>) × P(<em>b</em>). Normally the two segments of the sequence x<sub>1..<em>i</em></sub> and x<sub><em>i</em>+1..<em>n</em></sub> are not independent because we are using a hidden Markov model. Under our model, the distribution of characters at a given site is dependent on the hidden state at that site, which in turn is dependent on the hidden state at the previous site.</p> <p>But by conditioning on the hidden state at a given site <em>i</em>, the sequence after that site x<sub><em>i</em>+1..<em>n</em></sub> is independent of the sequence up to and including <em>i</em>. This is because the hidden state at <em>i</em> is fixed rather than depending on the previous hidden state, or the observed character at <em>i</em>. In other words, while P(x<sub>1..<em>i</em></sub>|M) and P(x<sub><em>i</em>+1..<em>n</em></sub>|M) are not independent, P(x<sub>1..<em>i</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) and P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) are! Therefore:</p> <p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x<sub>1..<em>i</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) × P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) × P(π<sub><em>i</em></sub>=<em>k</em>|M) / P(x|M)</p> <p>By applying the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">chain rule</a>, we can take the third term of the expression on the right side of our equation, and fold it into the first term of that expression. This changes the conditional probability to a joint probability:</p> <p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = P(x<sub>1..<em>i</em></sub>,π<sub><em>i</em></sub>=<em>k</em>|M) × P(x<sub><em>i</em>+1..<em>n</em></sub>|π<sub><em>i</em></sub>=<em>k</em>,M) / P(x|M)</p> <p>On the right side of the equation, the first term now corresponds to <em>f</em><sub><em>i</em>,<em>k</em></sub>, the second term to <em>b</em><sub><em>i</em>,<em>k</em></sub>, and the third to <em>b</em><sub><em>0</em>,<em>start</em></sub>. This makes it possible to replace every term on the right side expression with matrix coordinates:</p> <p>P(π<sub><em>i</em></sub>=<em>k</em>|x,M) = <em>f</em><sub><em>i</em>,<em>k</em></sub> × <em>b</em><sub><em>i</em>,<em>k</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub></p> <p>And now we can now “decode” our posterior distribution of hidden states. We need to refer back to the previously calculated forward matrix, shown below.</p> <table> <thead> <tr> <th style="text-align: right"> </th> <th style="text-align: right">{}</th> <th style="text-align: right">G</th> <th style="text-align: right">G</th> <th style="text-align: right">C</th> <th style="text-align: right">A</th> <th style="text-align: right">C</th> <th style="text-align: right">T</th> <th style="text-align: right">G</th> <th style="text-align: right">A</th> <th style="text-align: right">A</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">start</td> <td style="text-align: right">0.0</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-∞</td> </tr> <tr> <td style="text-align: right">CG rich</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-1.7</td> <td style="text-align: right">-3.0</td> <td style="text-align: right">-4.3</td> <td style="text-align: right">-6.6</td> <td style="text-align: right">-7.3</td> <td style="text-align: right">-9.6</td> <td style="text-align: right">-10.2</td> <td style="text-align: right">-12.5</td> <td style="text-align: right">-14.1</td> </tr> <tr> <td style="text-align: right">CG poor</td> <td style="text-align: right">-∞</td> <td style="text-align: right">-2.7</td> <td style="text-align: right">-4.2</td> <td style="text-align: right">-5.6</td> <td style="text-align: right">-5.9</td> <td style="text-align: right">-8.1</td> <td style="text-align: right">-8.7</td> <td style="text-align: right">-11.0</td> <td style="text-align: right">-11.6</td> <td style="text-align: right">-12.9</td> </tr> </tbody> </table> <p>As an example, let’s solve the posterior probability that the hidden state of the fourth character is CG rich:</p> <p>P(π<sub><em>4</em></sub>=<em>CG rich</em>|x,M) = <em>f</em><sub><em>4</em>,<em>CG rich</em></sub> × <em>b</em><sub><em>4</em>,<em>CG rich</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub> = e<sup>-7.0</sup> × e<sup>-6.6</sup> / e<sup>-12.7</sup> = 41%</p> <p>Since we only have two states, given a 41% posterior probability of the CG rich state, the probability of the CG poor state should be 59%, but the rounding errors caused by our lack of precision are causing serious problems:</p> <p>P(π<sub><em>4</em></sub>=<em>CG poor</em>|x,M) = <em>f</em><sub><em>4</em>,<em>CG poor</em></sub> × <em>b</em><sub><em>4</em>,<em>CG poor</em></sub> / <em>b</em><sub><em>0</em>,<em>start</em></sub> = e<sup>-7.2</sup> × e<sup>-5.9</sup> / e<sup>-12.7</sup> = 67%</p> <p>Using the above method with a high degree of precision, the posterior probabilities are more precisely calculated as 37% and 63% respectively. The posterior probabilities can be shown as a graph in order to clearly communicate your results:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/cpg-posterior.png" alt="CG rich island HMM" /></p> <p>This gives us a result that reflects the uncertainty of our inference given the limited data at hand. In my opinion, this presentation is more honest than the black-and-white maximum <em>a posteriori</em> result derived using Viterbi’s algorithm.</p> <p>For another perspective on HMMs, including the Viterbi and Forward-Backward algorithms, consult Chapter 10 of <a href="http://bioinformaticsalgorithms.com/">Bioinformatics Algorithms</a> (2nd or 3rd Edition) by Compeau and Pevzner.</p>Huw A. OgilvieLike the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algorithm, the backward algorithm is an instance of dynamic programming where the intermediate values are probabilities.COMP571/BIOC571 (Fall 2019)2019-08-29T15:00:00+10:002019-08-29T15:00:00+10:00http://www.cs.rice.edu/~ogilvie/comp571/2019/08/29/comp571-bioc571<p><strong>Important note:</strong> The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.</p> <h1 id="who">Who</h1> <p>Instructor:</p> <ul> <li>Huw A. Ogilvie</li> <li>Duncan Hall 3098</li> <li><a href="mailto:hao3@rice.edu">hao3@rice.edu</a></li> </ul> <p>TA:</p> <ul> <li>Zhen Cao</li> <li>Duncan Hall 3061</li> <li><a href="mailto:zc36@rice.edu">zc36@rice.edu</a></li> </ul> <h1 id="where-and-when">Where and when</h1> <p>Distribution of class materials and assignment submission will be conducted via <a href="https://canvas.rice.edu/">Canvas</a>.</p> <p>Seminars will be held in Duncan Hall <strong>1046</strong>, on Tuesdays and Thursdays, between 2:30–3:45 PM.</p> <p>One scheduled office hour will be held each week, at 1:00 PM on Thursdays in Duncan Hall 3061. Individual appointments outside this time are welcome.</p> <h1 id="intended-audience">Intended audience</h1> <p>The students who should take COMP571/BIOC571 are generally studying computer science, biology or genomics, and wish to learn how to apply algorithms and statistical models to important problems in biology and genomics.</p> <h1 id="course-objectives-and-learning-outcomes">Course objectives and learning outcomes</h1> <p>The primary objective of the course is to teach the theory behind methods in biological sequence analysis, including sequence alignment, sequence motifs, and phylogenetic tree reconstruction. By the end of the course, students are expected to understand and be able to write basic implementations of the algorithms which power those methods.</p> <h1 id="course-materials">Course materials</h1> <p>The main material for this course will be the course blog. But, if you wish to purchase a textbook, I highly recommend <strong>Bioinformatics Algorithms</strong> by Compeau &amp; Pevzner. This is the recommended text for COMP416 “Genome-Scale Algorithms”, and I will give references to relevant chapters in the third edition.</p> <h1 id="software-for-the-course">Software for the course</h1> <p>Algorithms and statistics will be demonstrated using Python. Don’t worry if you are not fluent in either language, as no programs will have to be written from scratch.</p> <p>The <a href="http://www.numpy.org/">NumPy</a> library for scientific computing will be used with Python. To install NumPy, first install the latest official distribution of Python 3. This can be downloaded for <a href="https://www.python.org/downloads/mac-osx/">macOS</a> or for <a href="https://www.python.org/downloads/windows/">Windows</a> from Python.org, and should already be included with your operating system if you are using Linux.</p> <p>Then simply use the Python package manage pip to install NumPy from the command line, by running <code class="highlighter-rouge">pip3 install numpy</code>.</p> <h1 id="schedule">Schedule</h1> <p>The course is organized around four themes;</p> <ol> <li>Models and algorithms used for sequence alignment</li> <li>Hidden Markov Models in computational biology</li> <li>Phylogenetic parsimony and likelihood</li> <li>Tree search methods</li> </ol> <p>Each theme will have a corresponding homework assignment. Themes 1 and 2 will be covered in the first midterm, and themes 3 and 4 in the second midterm.</p> <p><em>The below schedule may change subject to Rice University policy</em></p> <table> <thead> <tr> <th>Week</th> <th>Tuesday class</th> <th>Thursday class</th> <th>Homework</th> </tr> </thead> <tbody> <tr> <td>08/26/2019</td> <td>No class</td> <td>Introduction, genomes, central dogma and homology</td> <td> </td> </tr> <tr> <td>09/02/2019</td> <td>Empirical substitution matrices<sup>1</sup></td> <td>Global alignment<sup>1</sup></td> <td> </td> </tr> <tr> <td>09/09/2019</td> <td>Local alignment<sup>1</sup></td> <td>BLAST and BLAT<sup>1</sup></td> <td> </td> </tr> <tr> <td>09/16/2019</td> <td>PSSMs<sup>1</sup></td> <td>Pseudocounts<sup>1</sup></td> <td>#1 issued</td> </tr> <tr> <td>09/23/2019</td> <td>Hidden markov models<sup>2</sup></td> <td>Viterbi algorithm<sup>2</sup></td> <td> </td> </tr> <tr> <td>09/30/2019</td> <td>Forward algorithm<sup>2</sup></td> <td>Backward algorithm<sup>2</sup></td> <td>#1 due</td> </tr> <tr> <td>10/07/2019</td> <td>Applications of HMMs<sup>2</sup></td> <td>Midterm review<sup>1,2</sup></td> <td>#2 issued</td> </tr> <tr> <td>10/14/2019</td> <td>Midterm recess</td> <td>Midterm exam<sup>1,2</sup></td> <td> </td> </tr> <tr> <td>10/21/2019</td> <td>Phylogenetic trees<sup>3</sup></td> <td>Post-midterm review<sup>3</sup></td> <td>#2 due</td> </tr> <tr> <td>10/28/2019</td> <td>Equal-cost parsimony<sup>3</sup></td> <td>Unequal cost parsimony<sup>3</sup></td> <td> </td> </tr> <tr> <td>11/04/2019</td> <td>Likelihood of two sequences<sup>3</sup></td> <td>Felsenstein’s pruning algorithm<sup>3</sup></td> <td> </td> </tr> <tr> <td>11/11/2019</td> <td>The Felsenstein zone<sup>3</sup></td> <td>Hill climbing and MCMC<sup>4</sup></td> <td>#3 issued</td> </tr> <tr> <td>11/18/2019</td> <td>UPGMA and neighbor joining<sup>4</sup></td> <td>Molecular clocks<sup>4</sup></td> <td>#4 issued</td> </tr> <tr> <td>11/25/2019</td> <td>Course review<sup>4</sup></td> <td>Thanksgiving recess</td> <td>#3 due</td> </tr> <tr> <td>12/02/2019</td> <td>No class</td> <td>Final exam<sup>3,4</sup></td> <td>#4 due</td> </tr> </tbody> </table> <p>Superscript numbers refer to the theme(s) for that day’s class or midterm. Assignments will be both issued and due before midnight on Sundays.</p> <h1 id="grade-policies">Grade policies</h1> <ul> <li>First in-class midterm: 25%</li> <li>Second in-class midterm: 25%</li> <li>Four homework assignments: 12.5% each</li> </ul> <p>Students with a strong and valid excuse for not attending a midterm will be allowed to pick from one of the following options:</p> <ul> <li>Sit the midterm on a different day or time</li> <li>Adjust their grading to increase the contribution of the corresponding homework assignments to match the midterm’s contribution</li> <li>Adjust their grading to double the contribution of the alternate midterm</li> </ul> <p>Students with a strong and valid excuse for being unable to submit a homework assignment will be allowed to pick from one of the following options:</p> <ul> <li>Submit the homework assignment on a later day and time</li> <li>Adjust their grading to increase the contribution of the other homework assignments to match the assignment contribution</li> <li>Adjust their grading to increase the contribution of the corresponding midterm to match the assignment contribution</li> </ul> <p>For both assignments and midterms the strength and validity of excuses, and which of the above options are made available, will be solely the instructor’s purview. Without a strong and valid excuse, a penalty of 10 percentage points per day (which is equivalent to 1.25 points off the final course percent per day) will be applied to any assignment submitted after the deadline.</p> <h1 id="absence-policies">Absence policies</h1> <p>Attendance is expected at every class. Attendance for the midterm exams is compulsory and, without a strong and valid excuse, required to pass the course even if a student would have otherwise received a passing grade.</p> <h1 id="rice-honor-code">Rice Honor Code</h1> <p>In this course, all students will be held to the standards of the Rice Honor Code, a code that you pledged to honor when you matriculated at this institution. If you are unfamiliar with the details of this code and how it is administered, you should consult the Honor System Handbook at <a href="http://honor.rice.edu/honor-system-handbook/">http://honor.rice.edu/honor-system-handbook/</a>. This handbook outlines the University’s expectations for the integrity of your academic work, the procedures for resolving alleged violations of those expectations, and the rights and responsibilities of students and faculty members throughout the process.</p> <h1 id="students-with-a-disability">Students with a disability</h1> <p>If you have a documented disability or other condition that may affect academic performance you should: 1) make sure this documentation is on file with Disability Support Services (Allen Center, Room 111 / <a href="mailto:adarice@rice.edu">adarice@rice.edu</a> / x5841) to determine the accommodations you need; and 2) talk with me to discuss your accommodation needs.</p>Huw A. OgilvieImportant note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.Priors and clock models in StarBEAST2 tutorial2019-03-27T16:00:00+11:002019-03-27T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/oeb125/2019/03/27/clock-priors<h1 id="step-0-download-the-example-data-set">Step 0: Download the example data set</h1> <p>The following programs should already be installed:</p> <ul> <li>BEAST 2</li> <li>BEAUti 2</li> <li>DensiTree</li> <li>StarBEAST 2</li> <li>Tracer</li> </ul> <p>The first three come with the BEAST 2 package. StarBEAST2 is an add-on for BEAST 2. Tracer is a separate program from BEAST 2, and can be used to inspect the output of any Bayesian program that uses the MCMC algorithm.</p> <p>Download the <a href="http://www.cs.rice.edu/~ogilvie/assets/canis.zip">example archive</a>. This is a collection of multiple sequence alignments from species of <em>Canis</em> and closely related genera. After downloading the archive, extract it somewhere.</p> <h1 id="step-1-open-the-starbeast2-template">Step 1: Open the StarBEAST2 template</h1> <p>Open “BEAUti” (in the BEAST2 folder), which is a GUI application for configuring a BEAST2 analysis. Now select the StarBEAST2 template for multispecies coalescent analyses:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/open-sb2-template.png" alt="Open StarBEAST2 template" /></p> <p>The title bar should have changed to “BEAUti 2: StarBeast2”</p> <h1 id="step-2-import-multiple-sequence-alignments">Step 2: Import multiple sequence alignments</h1> <p>Now import the multiple sequence alignments you previously downloaded and extracted. Either click the plus button in the bottom left, or use the Import Alignment menu item:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/import-alignment.png" alt="Import alignment menu option" /></p> <p>Each locus has its own fasta file in the example data set. Select all of the loci to import, then select “all are nucleotide” when asked to specify the datatype:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/open.png" alt="Select all fasta files" /></p> <h1 id="step-3-link-clock-models">Step 3: Link clock models</h1> <p>All of the loci should appear in the main BEAUti window now. Select all of them, and choose to “Link Clock Models” using the button at the top. This will enable estimating a weighted average clock rate for all loci. After linking, all loci should be sharing the same clock model name:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/shared-clock-model.png" alt="Select all fasta files" /></p> <h1 id="step-3-specify-taxon-sets">Step 3: Specify taxon sets</h1> <p>Open the “Taxon sets” tab. This is where the mapping between the names used for gene sequences and the names of species is constructed. Click the button labelled “Guess” and select “Before Last”:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/before-last.png" alt="Select all fasta files" /></p> <p>Click OK. Notice that the taxon names all had an “_x” on the end. This is because BEAST gets mad if the names of species and the names used for gene sequences are the same. Adding this suffix, and removing it in BEAUti to get the species names, is a way around that issue. Your taxon sets should look like this:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/taxon-sets-done.png" alt="Select all fasta files" /></p> <h1 id="step-4-estimate-the-clock-rate">Step 4: Estimate the clock rate</h1> <p>Open the Clock Model tab, and enable “estimate” next to the clock rate. Change the rate to 0.001 - this is the initial value of the rate, and changing it to a value closer to the posterior mode will help our analysis to converge quickly. The Clock Model tab should now look like this:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/clock-model-done.png" alt="Select all fasta files" /></p> <h1 id="step-5-specify-the-clock-rate-prior">Step 5: Specify the clock rate prior</h1> <p>Go to the Priors panel, and expand the “strictClockRate” prior. Change the mean “M” to 0.001, and the standard deviation “S” to 0.1. The prior distribution on clock rate should now look like this:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/clock-prior-distribution.png" alt="Select all fasta files" /></p> <h1 id="step-6-save-and-launch">Step 6: Save and launch</h1> <p>Save your configuration to a new folder, called something like “slow prior”. Launch BEAST, and open the configuration file you just created. Start the analysis, which will take 5 to 10 minutes to complete.</p> <h1 id="step-7-rinse-and-repeat">Step 7: Rinse and repeat</h1> <p>Repeat steps 1 through 6, but this time specify an initial clock rate of 0.01, and a clock rate mean “M” of 0.01. Use the same standard deviation “S” of 0.1. Make sure to save your new configuration file in a <strong>different</strong> folder, called something like “fast prior”.</p> <h1 id="step-8-interrogate-the-results-using-tracer">Step 8: Interrogate the results using Tracer</h1> <p>Open Tracer, and select Import Trace File from the file menu. Open the “starbeast.log” file from your “slow prior” folder. Then open the “starbeast.log” file from your “fast prior” folder in the same Tracer window. Highlight both trace files:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/tracer-posterior.png" alt="Posterior density box plots" />.</p> <p>Select different statistics to see if their distributions are different between the analyses. In particular, look at the strictClockRate distributions (which should be the last statistic):</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/tracer-clock-rate.png" alt="Clock rate box plots" />.</p> <h1 id="step-9-look-at-the-trees-in-densitree">Step 9: Look at the trees in DensiTree</h1> <p>Once you are finished with Tracer, explore the different tree files for both analyses with DensiTree. The gene tree files are named based on the locus, and the species tree files are always called “species.trees”. For example, open the TRSP posterior distribution (\texttt{TRSP.trees}}) in DensiTree, and enable the “Full grid” so you can see the divergence times:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/trsp-slow.png" alt="Slow TRSP tree" />.</p> <p>You can use DensiTree to calculate the posterior probability of clades. This is the probability that a group of sequences or species are monophyletic (share a single common ancestor to the exclusion of all other sequences or species in the data set), given the data set and the model. To do this is a little convoluted:</p> <ol> <li>Select the “Central” display mode (it won’t work with the default).</li> <li>Enable the “clade toolbar” by selecting “view clade toolbar” in the window menu.</li> <li>Open the “Clades” “folder” on the right hand side and enable “Show clades”.</li> </ol> <p>The size of each circle is somewhat proportional to the posterior probability of the corresponding clade, and the postion along the X-axis is the expectation of the age of that clade. Take a note of the posterior probability and heights of some of the clades, for example:</p> <p><img src="http://www.cs.rice.edu/~ogilvie/assets/lupus-anthus-clade.png" alt="TRSP lupus anthus node" /></p> <p>Now compare the clades you can see and their heights in this gene tree with the clades in the corresponding “fast prior” gene tree, and both the “fast” and “slow” species trees.</p>Huw A. OgilvieStep 0: Download the example data setMaximum Parsimony tutorial using PAUP2019-02-20T16:00:00+11:002019-02-20T16:00:00+11:00http://www.cs.rice.edu/~ogilvie/phylogenetics-workshop/2019/02/20/paup-tutorial<h1 id="step-1-downloading-and-installing-software">Step 1: Downloading and installing software</h1> <p>For this tutorial the programs we will use are <a href="http://doua.prabi.fr/software/seaview">SeaView</a>, <a href="https://paup.phylosolutions.com/">PAUP</a>, and the text editor of your choice. SeaView has many uses, including:</p> <ul> <li>Viewing molecular sequences</li> <li>Algorithmic alignment of molecular sequences</li> <li>Manually editing and alignment of molecular sequences</li> <li>Estimating phylogenetic trees from molecular sequences</li> <li>Viewing phylogenetic trees</li> </ul> <p>If you are running Windows or macOS, you can download the latest version of SeaView from the <a href="http://doua.prabi.fr/software/seaview">SeaView web site</a>. If you are running Ubuntu, then SeaView is available from the package manager. You can install it from the “Ubuntu Software” GUI, or manually using <code class="highlighter-rouge">apt install seaview</code>.</p> <p>PAUP is used to infer trees from molecular data, and incorporates many different methods and models for doing so. These include:</p> <ul> <li>Maximum parsimony</li> <li>Maximum likelihood</li> <li>Distance based methods like neighbor-joining</li> <li><a href="https://doi.org/10.1093/bioinformatics/btu530">SVDquartets</a>, which is <a href="https://doi.org/10.1101/523050">statistically consistent with the multispecies coalescent</a></li> </ul> <p>If you are running macOS or Linux, please download the latest <strong>command line</strong> version of PAUP for your platform from the <a href="http://phylosolutions.com/paup-test/">PAUP test-version downloads</a> web site. Extract PAUP, and make sure the program is executable by opening the command line, navigating to the directory it was stored in, and running <code class="highlighter-rouge">chmod +x paup4a164_ubuntu64</code> on Ubuntu or <code class="highlighter-rouge">chmod +x paup4a164_osx</code> on macOS. If you are running Windows, download the Windows GUI version from the same web site.</p> <p>If you do not have a favorite text editor already, I recommend <a href="https://www.sublimetext.com/3">Sublime Text</a> or <a href="https://code.visualstudio.com/">Visual Studio Code</a>. You can download and install either program from their respective web sites.</p> <p>After downloading the software, download the <a href="http://www.cs.rice.edu/~ogilvie/assets/phylogenetics-workshop.zip">workshop materials</a> archive to your computer, and extract its contents.</p> <h1 id="step-2-exploring-the-true-tree-and-sequence-data">Step 2: Exploring the true tree and sequence data</h1> <p>Launch SeaView, and then open the <code class="highlighter-rouge">fz.tree</code> file in the <code class="highlighter-rouge">phylogenetics-workshop</code> folder. This will show you an ultrametric tree that was randomly generated for this workshop (using a coalescent model).</p> <p>Still in SeaView, open the <code class="highlighter-rouge">fz.nexus</code> multiple sequence alignment file. This is a 100,000 character alignment generated based on the tree you just opened, and using a Jukes-Cantor model of molecular evolution.</p> <h1 id="step-3-inferring-the-maximum-parsimony-tree-with-paup">Step 3: Inferring the maximum parsimony tree with PAUP</h1> <p>We will use PAUP to infer a phylogenetic tree. Open the command line on your computer, and navigate to the extracted <code class="highlighter-rouge">phylogenetics-workshop</code> folder. On Windows, run <code class="highlighter-rouge">paup fz.nexus</code>. On macOS or Linux, replace <code class="highlighter-rouge">paup</code> with the path to the PAUP executable on your computer. For example if you saved it to the Downloads folder on a Mac, this might be <code class="highlighter-rouge">~/Downloads/paup4a164_osx</code>. Run the following lines of PAUP code:</p> <ol> <li><code class="highlighter-rouge">Set Criterion=Parsimony;</code></li> </ol> <p>This tells PAUP that the parsimony score of a tree should be used to judge its goodness of fit.</p> <ol> <li><code class="highlighter-rouge">BandB;</code></li> </ol> <p>This command will identify the best fitting tree according to the parsimony criterion. Normally we have to use some kind of stochastic algorithm like hill-climbing or MCMC to infer trees, as the number of possible trees is so large. Because this data set is relatively small (100,000 sites and 12 taxa), we can instead use an exact “branch-and-bound” algorithm.</p> <ol> <li><code class="highlighter-rouge">SaveTrees file=mp.tree replace=yes;</code></li> </ol> <p>Save the inferred tree as a file with the name <code class="highlighter-rouge">mp.tree</code>.</p> <ol> <li><code class="highlighter-rouge">Quit;</code></li> </ol> <p>Should be self-explanatory.</p> <h1 id="step-3-exploring-the-inferred-tree">Step 3: Exploring the inferred tree</h1> <p>Open the inferred tree in SeaView. Make sure the true tree is still open. The kind of inference we used produces an unrooted tree without branch lengths, so you may have to reroot it or rotate nodes in SeaView. Experiment with the “Swap” and “Re-root” options in SeaView so that the trees match.</p> <p>What if any nodes are different between the truth and the estimated tree topology?</p>Huw A. OgilvieStep 1: Downloading and installing software