Abstract
image example

Understanding protein-protein interactions is vital to the study of cellular signaling pathways and cell regulation. Interaction networks mediate significant cellular processes from replication to programmed cell death, necessitating an analysis of the key proteins involved in these pathways. Identifying novel protein-protein interaction zones is the first step of many to identifying new interaction partners in the complex network of intracellular signaling. The goal of this project is to develop feature selection methods for identifying salient geometric and chemical properties of protein interaction zones and then utilize these features to distinguish between interaction and non-interaction zones via machine learning approaches such as support vector machines (SVMs) and k-nearest neighbors (kNN).

Problem Statement

Given (1) a crystallographic/NMR protein structure, (2) a set of known protein-protein interaction zones as partial protein structures (positive control), and (3) a set of partial protein structures known not be interaction zones (negative control), can we predict novel protein-protein interaction zones?

Proposal

Understanding protein-protein interactions is vital to the study of cellular signaling pathways and cell regulation. Interaction networks mediate significant cellular processes from replication to programmed cell death, necessitating an analysis of the key proteins involved in these pathways. Identifying novel protein-protein interaction zones is the first step of many to identifying new interaction partners in the complex network of intracellular signaling.

The goal of this project is to develop feature selection methods for identifying salient geometric and chemical properties of protein interaction zones and then utilize these features to distinguish between interaction and non-interaction zones via machine learning approaches such as support vector machines (SVMs) and k-nearest neighbors (kNN).

Because proteins are inherently complex molecules, the space of possible protein features is extremely large. Selecting salient features that can be used as markers for predicting the interaction capacity of protein surface regions is crucial to the development of sensitive and specific classifiers. Possible features of interest include and are by no means limited to:

  • residue/atom surface accessibility
  • residue hydropathy (hydrophobicity/hydrophilicity)
  • secondary structure elements, such as α-helices and β-sheets
  • residue side-chain mobility (flexibility), measured via B-factor
  • residue side-chain polar contacts with water, known ligands, and other side-chains
  • domain fold
  • surface cavities and protrusions

These features just scrape the surface of the information available from just the crystallographic structure file (PDB file: www.pdb.org). Many other features such as evolutionary conservation and sequence homology are interesting and valuable as well, but outside of the scope of this project.

While there are many diverse types of proteins capable of having protein-protein interaction domains, protein kinases will be the focus of this work. Kinases are interesting because they are capable of "selective promiscuity". These kinases interact with multiple partners, but with extremely high specificity for only those partners. Understanding the structural basis for this "selective promiscuity" is a goal of this work.

The complexity of kinase interaction networks

An SH3 domain-mediated interaction network identified from peptide array target screen.(take from website of Shawn Li, University of Western Schulich)

Project Outline

The basic outline for the progression of this project is as follows:

Hypothesis

I hypothesize that the identification of protein interfaces will not rely on only the existence of a handful of crucial features, but rather a collection of these features and their spacial distribution. Benchmarking against alternative approaches for protein interface prediction, such as ProMate (Neuvirth et al. 2004; ProMate web interface) will provide a valuable benchmark for sensitivity and specificity.

Timeline
February 1st-15thFinalize selection of protein features to investigate
February 16th-29thSelect families of proteins to include in the study
March 1st-15thRate classification power of SVMs on the dataset with and without kernels as appropriate. Assess ability of Principal Component Analysis (to reduce dimensionality) in combination with k-nearest neighbors techniques for classification.
March 16th-30thBenchmark sensitivity and specificity of model to existing alternative techniques
AprilWrite final report
Protein Feature Gallery

Crystal structure of human tyrosine-protein kinasE C-src, in complex with amp-pnp; PDB ID: 2SRC

      All images created in PyMol by Drew Bryant, 2008



Secondary Structure

Solvent Accessibility + B-factor (amino acid mobility)

Side-Chain Polar Contacts

Surface Polarity

All images created in PyMol by Drew Bryant, 2008

Contact

email: drew.h.bryant/AT/gmail.com