StickWrld :: The Research Institute at Nationwide Children's Hospital, Columbus, Ohio

StickWRLD: Visual Analytics Over Many Data Types

Visualizing multiple-sequence alignments

Many powerful tools have been created to detect and describe the similarities between multiple nucleic acid or multiple protein sequences. Frequently these take the form of a sequence consensus, expressing either simple most-popular positional identities, positional identities with allowances for varying positions, or some type of statistical description of the positional frequency characteristics of the defining sequence family. Despite the fact that some provide intuitively interpretable descriptions of the consenses themselves, they typically do not give the viewer any information about regions of the sequence that might have inter-positional dependencies, and that therefore do not obey a strict consensus behavior.  Such non-consensus behavior may be related to inter- or intra-molecular structural interactions, to phylogentic relationships, or to less understood biological causes. Recognizing these interactions can be useful for developing better motif and family descriptions, or for defining possible structural constraints on molecular families with unknown structure.

We have developed MAVL (Multiple Alignment Variation Linker) and StickWRLD. MAVL is our Web-based application for detecting and displaying correlations in biomolecular sequence families. MAVL examines all positional pairs in each of a collection of pre-aligned sequences and determines any pairs that occur with unexpected frequency, and constructs either a static, downloadable VRML graph of the alignment properties, or a JAVA-powered interactive interface to the alignment.  MAVL/StickWRLD can be used through its Web interface.  This research has produced several publications in Nucleic Acids Research, including an invited cover. Additional information is available on the MAVL/StickWRLD website.

Visual Analytics for Protein Structural Constraints

The logical consequence of the high cost and difficulty of determining protein structures experimentally, has been the production of numerous methods which, with varying levels of success, attempt to use protein sequences to predict their resultant structures. The focus of many of these has been the determination of structural constraints from apparently co-evolved mutations within the protein sequences.

Despite compelling logic – structurally stabilizing interactions between amino-acid pairs should encourage the change of the paired partner if one of the pair changes properties – the pursuit of such constraints from patterns of co-evolution has met with limited success. Many protein families contain multiple pairs of amino acids with quite highly correlated co-evolved mutations, which nonetheless have no structural contact. The inescapable consensus has been that there is simply too little information contained in the correlated mutations to widely use them for making predictions about structural constraints.

In pursuing an unrelated topic examining correlated mutations, my laboratory has made an observation that challenges this consensus. We observed that in many regions where there are structural contacts, many sequence families show an abundance of weakly co-evolved mutations, and, using our visualization tool, this abundance of weakly correlating residue identities takes on a distinct visual pattern. The pattern derived entirely from the multiple-sequence family alignment – is sufficiently distinct that it appears to be unmistakable, even to the untrained eye. We have literally shown these visualizations to art students with no information at all about what they were looking at, or what we were looking for, and had them gravitate directly to portions of the visualization that were related to structural contacts. Because of the consistent occurrence of these visualizable patterns, and because the users who have tested our visualization method consistently identify the same pattern regions in any given visualization, we are confident that this method generalizes across protein families. More importantly, by studying the patterns from humans, we expect that an algorithm can be trained to make the same predictions. Through the aims of this proposal we will validate that these patterns are present and indicative of structural relationships across a broad spectrum of protein sequence families, determine optimal visualization parameters to allow the best prediction of constraints, and develop a pattern-mining application which can learn the how users make their predictions. We will incorporate our findings into an online, WWW-accessible application for determining these constraints, and make the database of predictions available online for additional study.

Project Members:

- William C. Ray, PhD
- Wolfgang Rumpf, PhD
- Maya Sugembong
- Nicholas Callahan

Nationwide Children's Hospital
700 Children's Drive Columbus, Ohio 43205 614.722.2000