comp571

PAM (Dayhoff) matrices

7 minute read

The point accepted mutation (PAM) substitution model, also known as the Dayhoff substitution model, is an amino acid substitution model derived from empirica...

COMP571 (Fall 2022)

6 minute read

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

Backward algorithm

6 minute read

Like the forward algorithm, we can use the backward algorithm to calculate the marginal likelihood of a hidden Markov model (HMM). Also like the forward algo...

Forward algorithm

5 minute read

The Viterbi algorithm identifies a single path of hidden Markov model (HMM) states. This is the path which maximizes the joint probability of the observed da...

Viterbi algorithm

9 minute read

The Viterbi algorithm is used to efficiently infer the most probable “path” of the unobserved random variable in an HMM. In the CpG islands case, this is the...

COMP571 (Fall 2020, a.k.a. COVID times)

6 minute read

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

Calculating the likelihood for an ultrametric tree (example)

5 minute read

In this example we will calculate the likelihood \(P(D|T,h)\) where \(D\) is a single site, \(T\) is a rooted tree topology, and \(h\) is the node heights fo...

Hill climbing and NNI

5 minute read

The Sankoff algorithm can efficiently calculate the parsimony score of a tree topology. Felsenstein’s pruning algorithm can efficiently calculate the probabi...

Long branch attraction (in the Felsenstein zone)

3 minute read

Long branch attraction is the phenomenon where two branches which are in truth not sisters are inferred to be sister branches when using maximum parsimony in...

Likelihood of a tree

7 minute read

The likelihood of a tree is the probability of a multiple sequence alignment or matrix of trait states (commonly known as a character matrix) given a tree to...

Dollo’s law and unequal-cost parsimony

4 minute read

Certain mutations are more surprising than others. DNA is composed of a string of nucleotides, which are either pyrimadines (cytosine or thymine) or purines ...

Equal-cost parsimony

4 minute read

The principle behind maximum parsimony based inference is to explain the data using the smallest cost. In its most basic form, all events are given equal cos...

COMP571/BIOC571 (Fall 2019)

4 minute read

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...

Neighbor joining

5 minute read

Neighbor joining is similar to UPGMA/WPGMA, but infers unrooted trees. As a consequence, and unlike UPGMA/WPGMA, it does not require that the multiple sequen...

UPGMA and WPGMA trees

5 minute read

Two related methods for infer phylogenetic trees from multiple sequence alignments (MSAs) are the Unweighted Pair Group Method with Arithmetic Mean (UPMGA) a...

GTR and nested models

2 minute read

The most commonly used nucleotide substitution models for phylogenetic reconstruction belong to the general time reversible (GTR) family. The main reason for...

Tree comparisons

6 minute read

The crudest way to compare two trees is to somehow quantify the difference between them as a single number. There are many methods of quantifying the differe...

Phylogenetic trees and data structures

3 minute read

Bifurcating trees have found enormous utility in biology. Depending on what the tips of a tree represent, they can be used as a model to help understand vari...

Hidden Markov models

10 minute read

Hidden Markov models (HMMs) are a class of Markov models where the same states of a random variable (e.g. the four nucleotides of DNA) can be generated by di...

Markov models

6 minute read

Markov processes model the change in random variables along a time dimension, and obey the Markov property. Informally, the Markov property requires that the...

Pseudocounts

7 minute read

When computing PSSMs, averaging does not take into account the amount of information available in a multiple sequence alignment (MSA) used to construct a PSS...

Bayesian inference and MCMC

8 minute read

Bayesian statistics is the practice of updating the probability of the value of some parameter θ of model M being the true value, based on observations (D fo...

Probability and likelihood distributions

3 minute read

The probability of a value can mean its probability of being the true value of some parameter θ, conditional on some model M or ensemble of models. It can al...

Position-specific score matrices

7 minute read

Global and local alignment tools like BLAST are typically used to search for and align homologous sequences which are related by common descent. However we m...

Burrows–Wheeler transform

15 minute read

Hash tables are a very good general solution for providing quick access to values (e.g. k-mer positions) based on keys (e.g. k-mers). There is an even more e...

Arrays and addressing

1 minute read

To access elements in an array or matrix, programming languages typically implement one of two addressing schemes. One of the schemes uses one-based numberin...

Indexed MegaBLAST, BLAT, and SSAHA

3 minute read

BLAST and FASTA are based on building an index from the query sequences, and scanning the database sequences for k-mers which are present in the query index....

BLAST and FASTA

6 minute read

Pure Smith–Waterman is unpopular because of its poor scaling. Given two sequences of unequal lengths n and m, we have to compute nm intermediate values. We t...

PAM vs BLOSUM score matrices

5 minute read

Amino acids, nucleotides or any other evolutionary character are replaced by others at some rate. For example, imagine an evolutionary sequence with three po...

Expected values of score matrices

3 minute read

There are many ways to compute score matrices for amino acid alignment. Typically score matrices are based on the log-odds of amino acid replacements, and so...

Smith–Waterman

3 minute read

The Smith–Waterman algorithm is basically the Needleman–Wunsch, but with a few modifications to turn it into a local alignment algorithm. These are:

Needleman–Wunsch

16 minute read

Global alignment algorithms are used to align two globally similar sequences. Commonly global alignment is used to identify first whether two protein, DNA or...

Recursion

1 minute read

Recursion is where a single function inside a computer program calls itself. This turns out to be an extremely versatile tool in the computer science toolbox...

COMP571/BIOC571

4 minute read

Important note: The information contained in the course syllabus, other than the absence policies, may be subject to change with reasonable advance notice, a...