# comp571

## PAM (Dayhoff) matrices

The point accepted mutation (PAM) substitution model, also known as the Dayhoff substitution model, is an amino acid substitution model derived from empirica...

## COMP571 (Fall 2022)

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

## Backward algorithm

Like the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algo...

## Forward algorithm

The Viterbi algorithm identifies a single path of hidden Markov model (HMM) states. This is the path which maximizes the joint probability of the observed da...

## Viterbi algorithm

The Viterbi algorithm is used to efficiently infer the most probable “path” of the unobserved random variable in an HMM. In the CpG islands case, this is the...

## COMP571 (Fall 2020, a.k.a. COVID times)

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

## Calculating the likelihood for an ultrametric tree (example)

In this example we will calculate the likelihood $$P(D|T,h)$$ where $$D$$ is a single site, $$T$$ is a rooted tree topology, and $$h$$ is the node heights fo...

## Hill climbing and NNI

The Sankoff algorithm can efficiently calculate the parsimony score of a tree topology. Felsenstein’s pruning algorithm can efficiently calculate the probabi...

## Long branch attraction (in the Felsenstein zone)

Long branch attraction is the phenomenon where two branches which are in truth not sisters are inferred to be sister branches when using maximum parsimony in...

## Likelihood of a tree

The likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree to...

## Dollo’s law and unequal-cost parsimony

Certain mutations are more surprising than others. DNA is composed of a string of nucleotides, which are either pyrimadines (cytosine or thymine) or purines ...

## Equal-cost parsimony

The principle behind maximum parsimony based inference is to explain the data using the smallest cost. In its most basic form, all events are given equal cos...

## COMP571/BIOC571 (Fall 2019)

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

## Neighbor joining

Neighbor joining is similar to UPGMA/WPGMA, but infers unrooted trees. As a consequence, and unlike UPGMA/WPGMA, it does not require that the multiple sequen...

## UPGMA and WPGMA trees

Two related methods for infer phylogenetic trees from multiple sequence alignments (MSAs) are the Unweighted Pair Group Method with Arithmetic Mean (UPMGA) a...

## GTR and nested models

The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for...

## Tree comparisons

The crudest way to compare two trees is to somehow quantify the difference between them as a single number. There are many methods of quantifying the differe...

## Phylogenetic trees and data structures

Bifurcating trees have found enormous utility in biology. Depending on what the tips of a tree represent, they can be used as a model to help understand vari...

## Hidden Markov models

Hidden Markov models (HMMs) are a class of Markov models where the same states of a random variable (e.g. the four nucleotides of DNA) can be generated by di...

## Markov models

Markov processes model the change in random variables along a time dimension, and obey the Markov property. Informally, the Markov property requires that the...

## Pseudocounts

When computing PSSMs, averaging does not take into account the amount of information available in a multiple sequence alignment (MSA) used to construct a PSS...

## Bayesian inference and MCMC

Bayesian statistics is the practice of updating the probability of the value of some parameter θ of model M being the true value, based on observations (D fo...

## Probability and likelihood distributions

The probability of a value can mean its probability of being the true value of some parameter θ, conditional on some model M or ensemble of models. It can al...

## Position-specific score matrices

Global and local alignment tools like BLAST are typically used to search for and align homologous sequences which are related by common descent. However we m...

## Burrows–Wheeler transform

Hash tables are a very good general solution for providing quick access to values (e.g. k-mer positions) based on keys (e.g. k-mers). There is an even more e...

To access elements in an array or matrix, programming languages typically implement one of two addressing schemes. One of the schemes uses one-based numberin...

## Indexed MegaBLAST, BLAT, and SSAHA

BLAST and FASTA are based on building an index from the query sequences, and scanning the database sequences for k-mers which are present in the query index....

## BLAST and FASTA

Pure Smith–Waterman is unpopular because of its poor scaling. Given two sequences of unequal lengths n and m, we have to compute nm intermediate values. We t...

## PAM vs BLOSUM score matrices

Amino acids, nucleotides or any other evolutionary character are replaced by others at some rate. For example, imagine an evolutionary sequence with three po...

## Expected values of score matrices

There are many ways to compute score matrices for amino acid alignment. Typically score matrices are based on the log-odds of amino acid replacements, and so...

## Smith–Waterman

The Smith–Waterman algorithm is basically the Needleman–Wunsch, but with a few modifications to turn it into a local alignment algorithm. These are:

## Needleman–Wunsch

Global alignment algorithms are used to align two globally similar sequences. Commonly global alignment is used to identify first whether two protein, DNA or...