CONTRAlign: CONditional TRAining for Protein Sequence Alignment

Description

CONTRAlign is an extensible and fully automatic parameter learning framework for protein pairwise sequence alignment based on pair conditional random fields. The CONTRAlign framework enables the development of feature-rich alignment models which generalize well to previously unseen sequences and avoid overfitting by controlling model complexity through regularization.

When given as few as 20 example alignments, CONTRAlign simultaneously learns both a substitution matrix and gap penalties yielding accuracies competitive with modern alignment tools. This provides a stark contrast to traditional methods of parameter estimation, which typically require several order of magnitude more training data.