Questions

The aim of the CONTRA project is to better understand discriminatively-trained probabilistic models of sequences (known as conditional random fields or CRFs) and their application to a variety of problems in computational biology.

More specifically, most current applications of probabilistic sequence models in computational biology use generative probabilistic models, such as hidden Markov models (HMMs) and stochastic/probabilistic context free grammars (SCFGs/PCFGs). Generative models treat sequences as the result of simulating stochastic process: in HMMs, the stochastic process involves transitioning from one state of a finite state automaton to the next; in SCFGs/PCFGs, the stochastic process involves randomly picking the next production rule to apply to the current partial parse tree. While generative models are intuitive and allow convenient parameter training via maximum joint likelihood techniques, they also make many strong assumptions regarding the stochastic nature of the data they attempt to fit.

In the CONTRA project, we consider the application of discriminative probabilistic models, an alternative to generative models, to various problems in computational biology for which generative models have previously been proposed. Unlike generative models, discriminative models are trained by maximizing conditional likelihood (i.e., CONditional TRAining). By applying discriminative techniques to these problems, we hope to

  1. gain an understanding of when generative or discriminative techniques are appropriate for a given problem,
  2. learn about the biology of the processes we model by examining the model estimated by the learning algorithm, and
  3. provide a sound mathematical foundation for new tools which deal with problems in computational biology.

Return to top.

Unsurprisingly, the best alignment tool will depend on the situation. In our cross-validation tests, CONTRAlign gives statistically significant advantages in pairwise alignment accuracy on average compared to other current tools. However,

  • for any particular alignment task, any of a number of different alignment tools (see links page) may perform the best;
  • for particular alignment situations, other alignment algorithms may be better. For instance when using an alignment to identify a small region of local similarity within two long sequences, you're best off using a local alignment tool (CONTRAlign is fundamentally a global aligner).

In general, no matter which tool you use, be cautious about trusting an alignment reported by any automatic alignment program, especially when the sequences have low (<30) percent identity!

Return to top.

CONTRAlign should do as well as any other aligner in most pairwise alignment situations. CONTRAlign is based on a global alignment model, so if you think your sequence may have a local alignment, try using the "double-affine gap scoring" option with CONTRAlign.

Return to top.

Currently, there is not. Parameter learning in CONTRAlign relies on the existence of a training set of gold standard alignments. While structural methods are used to obtain gold standard protein sequence alignments, currently no equivalent procedure exists for DNA sequences.

We eventually plan to incorporate the unsupervised learning methods used in PROBCONS for nucleotide sequence alignment. Preliminary tests of this strategy using the BRAliBASE 2 RNA alignment benchmark database show that such an approach holds much promise for developing high accuracy alignment algorithms without reference data.

Return to top.

Please cite:

Do, C.B., Gross, S.S., and Batzoglou, S. (2006) CONTRAlign: Discriminative Training for Protein Sequence Alignment. In Proceedings of the Tenth Annual International Conference on Computational Molecular Biology (RECOMB 2006). (pdf, ps.gz)

Return to top.