Comp/Stat 470: From sequence to structure: Module II
Hidden Markov Models
The classic paper is A tutorial on hidden Markov models and
selected applications in speech recognition by Lawrence Rabiner,
in the Proceedings of the IEEE, 77:2(257-286), 1989.
What you need to know about HMMs:
- What HMMs are.
- The forward and backward algorithm.
- The Viterbi algorithm and smoothing/posterior decoding.
- How to train HMMs.
Computational genefinding using HMMs
Check the genefinding web site to get at the latest list of gene finders, data sets, and bibliography.
Reviews of genefinding
on computational gene finding, D. Haussler, 1998.
A short introduction to signal and content sensors and integrated gene
finding approaches based on the hidden Markov model. It is a little dated now, but still serves as an excellent introduction to the difficulties in finding genes ab initio in
the human genome.
- Computational prediction of eukaryotic protein-coding genes, M. Q. Zhang, Nature Reviews Genetics (2002), Vol. 3, 698-709.
An excellent assessment, up to 2002, of available approaches to
identifying protein-coding genes. It explains why exon boundaries are
difficult to detect accurately, and identifies the need to incorporate
more biological knowledge, and to build more specialized computational
approaches for identifying protein-coding genes.
- Comparative Genomics:
genome-wide analysis in metazoan eukaryotes, Birney et. al.,
Nature Reviews Genetics (2003) Vol. 4, 251-262.
An excellent, more recent review (2003) of gene and regulatory region prediction using multiple
genomes. Sets out the basics of comparative genomics
- JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the
features of human genes in the ENCODE regions., J. E. Allen and
W. H. Majoros and M. Pertea and S. L. Salzberg. Genome Biology, 2006;7
A current review paper that compares three state of the art gene finders, and offers prescriptions on how the next generation of gene finders should be designed.
Methods of genefinding
- Prediction of complete gene structure in human genomic DNA , C. Burge and S. Karlin, J. Mol. Biol. (1997), Vol. 268, 78-94.
The original GENSCAN paper which lays out the HMM model of genefinding. Is still a widely used
standard for comparison of gene-finding programs. Open source versions of GENSCAN are mnot readily available.
- Gene finding in novel genomes, I Korf, 1: BMC Bioinformatics. 2004 May 14;5:59
This paper introduces SNAP, an ab initio gene finder which
demonstrates that a simplified version of the HMM model used by
GENSCAN tuned for a specific genome, outperforms the more general GENSCAN model.
SNAP is downloadable here.
- GeneWise and
Genomewise, Ewan Birney, Michele Clamp and Richard Durbin, Genome
Research, Volume 14, pages 988-995, 2004.
How do we incorporate extra knowledge into genefinders? GeneWise and
Genomewise demonstrate one approach. GeneWise predicts gene structure
using similar protein sequences, and Genomewise, provides a gene
structure final parse across cDNA- and EST-defined spliced structure.
Both algorithms are used by the Ensembl annotation system. The
GeneWise algorithm is a principled combination of hidden Markov models
Check out the update Ensemble 2005.
decoding algorithms for generalized hidden Markov model gene finders
, WH Majoros and M. Pertea and A.L. Delcher and S.L. Salzberg,
BMC Bioinformatics. 2005 Jan 24;6(1):16.
This paper introduces methods for efficiently incorporating homology information to assist in gene prediction.
- An empirical analysis of training protocols for probabilistic gene finders, Majoros WH, Salzberg SL, 1: BMC Bioinformatics. 2004 Dec 21;5(1):206.
A systematic account of how to set up training data for HMM gene finders.
- JIGSAW: integration of multiple sources of evidence for gene prediction, J. E. Allen and S. L. Salzberg, Bioinformatics, Volume 21, No. 18, pp 3596-603, 2005.
An open source program that integrates an ab-initio gene finder with other evidence obtained from homology and ESTs.
- GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses,
J. Besemer and M. Borodovsky, Nucleic Acids Research, Web Server Issue W451-4, 2005.
A state-of-the-art gene finder which you can access through a web interface.
- Applications of generalized pair hidden Markov models to alignment and gene finding problems, L. Pachter, M. Alexandersson and M. Cawley, J. Comp. Biol. (2002), Vol. 9, No. 2, 389-399.
The original SLAM paper. Introduces pair HMMs for the gene finding task.
and mouse gene structure: comparative analysis and application to exon
prediction, S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger and
E. S. Lander, Genome Res. 2000 Jul;10(7):950-8.
Uses identification of orthologous genes and dynamic programming to make predictions of genes across species.
- Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat, Dewey C, Wu JQ, Cawley S, Alexandersson M, Gibbs R, Pachter L, 1: Genome Res. 2004 Apr;14(4):661-4.
Version 2 of the SLAM paper with more detailed computational results and experimental validation.
- Efficient implementation of a generalized pair hidden Markov model for comparative gene finding, W.H. Majoros and M. Pertea and S. Salzburg, Bioinformatics, 21(9):1782-8, 2005.
This paper describes an open-source pair GHMM model for gene finding.
- Gene finding in the chicken genome, Eyras E, Reymond A, Castelo R, Bye JM, Camara F, Flicek P, Huckle EJ, Parra G, Shteynberg DD, Wyss C, Rogers J, Antonarakis SE, Birney E, Guigo R, Brent MR, BMC Bioinformatics, Volume 6, No. 1, 2005.
The application of Twinscan, Ensembl and SGP2 to the problem of predicting genes on the chicken genome using the human genome as a reference.
Evaluating genefinding programs
- GFPE: gene-finding program evaluation, Wang J, Kraemer E, 1: Bioinformatics. 2003 Sep 1;19(13):1712-3.
- Evaluation of gene finding programs on mammalian sequences,Rogic et. al., Genome Research (2001), Vol. 11, 817-832.
- Comparison of various algorithms for recognizing short coding sequences of human genes,
Gao F, Zhang CT, 1: Bioinformatics. 2004 Mar 22;20(5):673-81. Epub 2004 Feb 5
Other approaches to genefinding
The chromosome 22 page is here.
Fourier characteristic of coding sequences, C. Yin and S. S. Yau,
J. Computational Biology, Volume 12, No. 9, pp 1153-65, 2005.
- SVM classification of human intergenic and gene sequences,
Qiao YH, Liu JL, Zhang CG, Xu XH, Zeng YJ, Math Biosci. 2005 Jun;195(2):168-78.
I will cover two families of supervised learning methods:
discriminative models exemplified by support vector machines, and
characteristic models exemplified by naive Bayes.
I will also cover techniques for feature selection in the
context of microarray data analysis.
A Matlab based toolbox to experiment with these methods is
Support vector machines
SVMs for molecular classifications of cancer from gene expression data
classification of cancer: class discovery and class prediction by gene
expression monitoring, Golub TR, Slonim DK, Tamayo P, Huard C,
Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA,
Bloomfield CD, Lander ES, Science. 1999 Oct 15;286(5439):531-7.
analysis of microarray gene expression data by using support vector
machines, Michael P. S. Brown, William Noble Grundy, David Lin,
Nello Cristianini, Charles Walsh Sugnet, Terrence S. Furey, Manuel
Ares, Jr., and David Haussler, PNAS 2000; 97: 262-267.
vector machine classification and validation of cancer tissue samples
using microarray expression data, Furey TS, Cristianini N, Duffy
N, Bednarski DW, Schummer M, Haussler D, Bioinformatics. 2000
- Tissue classification with gene expression profiles,
Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z,
Comput Biol. 2000;7(3-4):559-83.
cancer diagnosis using tumor gene expression signatures, Sridhar
Ramaswamy, Pablo Tamayo, Ryan Rifkin, Sayan Mukherjee, Chen-Hsiang
Yeang, Michael Angelo, Christine Ladd, Michael Reich, Eva Latulippe,
Jill P. Mesirov, Tomaso Poggio, William Gerald, Massimo Loda, Eric
S. Lander, and Todd R. Golub, PNAS 2001 98: 15149-15154;
- Selection bias in gene extraction on the basis of
microarray gene-expression data, Christophe Ambroise and Geoffrey
J. McLachlan, PNAS 2002 99: 6562-6566.
of multiple cancer types by multicategory support vector machines
using gene expression data, Lee Y, Lee CK, Bioinformatics. 2003
Learning Bayesian networks
Learning Bayesian networks: theory
tutorial on learning with bayesian networks , D. Heckerman,
- Learning bayesian
network structure from massive data sets: the sparse candidate
algorithm, N. Friedman, I. Nachman and D. Pe'er, Proc. Fifteenth
Conf. on Uncertainty in Artificial Intelligence (UAI) 1999.
- Being bayesian
about network structure: a bayesian approach to structure discovery in
bayesian networks,N. Friedman and D. Koller, Machine Learning,
Learning probabilistic relational models , L. Getoor, N. Friedman,
D. Koller, and A. Pfeffer. Invited contribution to the book Relational
Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001.
- Bayes nets fundamentals, a compiled list of introductory material on bayesian networks.
Background reading on genetic networks
- Modeling and simulation of genetic
regulatory networks, H.De Jong, Journal of Computational Biology,
regulatory networks in Saccharomyces cerevisiae, Lee, T. I.,
Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber,
G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et
al. (2002) Science 298, 799-804.
mapping of the yeast genetic interaction network, Tong AH, Lesage
G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang
M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J,
Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H,
Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu
AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG,
Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G,
Roth FP, Brown GW, Andrews B, Bussey H, Boone C, Science. 2004 Feb
Inferring regulatory networks from genomic and proteomic data
- Inferring cellular networks using probabilistic graphical models, N. Friedman, Science 303(5659)799-805, 2004.
- Revealing modularity
and organization in the yeast molecular network by integrated analysis
of highly heterogeneous genomewide data, Tanay A, Sharan R, Kupiec
M, Shamir R, Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6. Epub
2004 Feb 18.
- Using protein-protein interactions for refining gene
networks estimated from microarray data by Bayesian networks,
Nariai N, Kim S, Imoto S, Miyano S, Pac Symp Biocomput. 2004;:336-47.
Networks: Discovering Regulatory Modules and their Condition Specific
Regulators from Gene Expression Data, E. Segal, M. Shapira,
A. Regev, D. Pe'er, D. Botstein, D. Koller, N. Friedman Nature
Genetics, 2003 June, 34(2): 166-76. Supplement to
- Genome-wide Discovery of
Transcriptional Modules from DNA Sequence and Gene Expression,
E. Segal, R. Yelensky, D. Koller Bioinformatics, 2003; 19 Suppl 1.
- Discovering Molecular Pathways from Protein
Interaction and Gene Expression Data, E. Segal, H. Wang, D. Koller
Bioinformatics, 2003; 19 Suppl 1.
- Decomposing Gene Expression into Cellular Processes,
E. Segal, A. Battle, D. Koller
In Proceedings of the 8th Pacific Symposium on Biocomputing (PSB), Kaua'i, January 2003.
- Rich Probabilistic Models for Gene Expression,
E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller
Bioinformatics, 2003; 17 Suppl 1:S243-252
gene networks from gene expression data by combining Bayesian network
model with promoter element detection, Tamada Y, Kim S, Bannai H,
Imoto S, Tashiro K, Kuhara S, Miyano S, Bioinformatics. 2003 Oct;19
Boolean Networks: a rule-based uncertainty model for gene regulatory
networks,Shmulevich I, Dougherty ER, Kim S, Zhang W.,
Bioinformatics. 2002 Feb;18(2):261-74.
- Minreg: inferring an active regulator set,Pe'er D, Regev A, Tanay A,
Bioinformatics. 2002;18 Suppl 1:S258-67.
subnetworks from perturbed expression profiles, Pe'er D, Regev A,
Elidan G, Friedman N, Bioinformatics. 2001;17 Suppl 1:S215-24.
- Using Bayesian networks to analyze expression
data, Friedman N, Linial M, Nachman I, Pe'er D. J Comput
Biol. 2000;7(3-4):601-20. Check here, if paper unavailable from pubmed link.
Bayesian network software
- Bayes Net
Toolbox for Matlab, Kevin Murphy, MIT.
Bayes net software. and WinMine
, both from Microsoft.
- Genie and Smile (C++ software for building and learning bayesian networks from data), University of Pittsburgh.
Explorer: A Probabilistic Network Learning Toolkit for Biomedical
Discovery, C. Aliferis, I. Tsamardinos,
A. Statnikov. International Conference on Mathematics and Engineering
Techniques in Medicine and Biological Sciences (METMBS), 2003.[download] (Matlab under Win 32)
from Glymour's group at CMU.
Other useful software and data for exploring metabolic networks and gene expression data
Last modified: Sun Jan 4 21:33:49 CST 2009