In human genetics, we tend to equate the strength of statistical evidence with either likelihood ratios or p-values. However, both LRs and p-values have highly undesirable properties as evidence measures when applied to human genetics and in many other settings. Among other deficiencies, they can tend to erroneously indicate diminished evidence as more data are considered. This violates a fundamental assumption underlying standard human genetic linkage and association designs in which we first scan the genome for our best signals and then follow up at those genomic positions with additional data, assuming that the largest LRs or smallest p-values correspond to positions with the strongest evidence, and that following up with more data will increase true positive signals. We are developing a coherent theoretical approach to formalizing statistical evidence measures, using a set of minimal requirements that any evidence measure should meet, drawing heavily on an analogy with the development of the thermometer based on developments in thermodynamics. We expect that measures of evidence that come closer to meeting these requirements will do a better job of finding and characterizing genes and have greater utility for other disciplines as well.
We have formulated a quantitative trait (QT) analog of the PPL, the QT-PPL, which directly measures the evidence that a QT is linked to (and/or in LD with) a genetic marker or location. The new QT-PPL is based on a classical single-locus QT likelihood with the trait parameters (allele frequency, genotypic means and variances) integrated out. We have shown using simulations that the QT-PPL is robust to two key modeling violations (multiple trait loci and non-normality in the form of excess kurtosis), as well as being inherently ascertainment corrected, and have illustrated the advantages of the QT-PPL for accumulating linkage evidence across multiple sets of data compared to other QT linkage methods. We are extending this research to further our understanding of quantitative heterogeneity by assessing more physiological processes, and in more depth, than has been customarily done in human genetics. By examining these processes we will be able to address critical phenotypic issues including the relationship between clinical diagnosis and quantitative measures in the context of treatment, for example. We will accomplish this analysis by using existing, and developing new, novel multivariate and mixed quantitative/categorical linkage and association methods.
Although large-scale whole genome linkage/association approaches have provided information concerning genetic etiology, there has been difficulty finding and replicating results. Clearly no one study design or method of analysis will be uniquely successful. Geneticists have long appreciated that differences in ancestry may introduce differences in genetics (heterogeneity) when multiple pedigrees from different subpopulations are combined into larger samples. This heterogeneity may potentially reduce power because the effects of a specific genetic factor may depend on the ancestry-related differences in genetic background. Furthermore, combining samples across multiple studies is only powerful to the extent that the genetic etiology of disease is consistent across those multiple subpopulations. Aside from combining samples, the other longstanding approach in the genetics literature has been to simply exclude non-Caucasian samples, an increasingly untenable approach given the increasing population of non-Caucasians in the US population. Most often, analysis is performed twice, once with the entire family collection and then again excluding the few non-Caucasian families (often too few for an independent analysis). Rather than limiting or excluding heterogeneity, we present a complementary genetic approach that capitalizes on it and allows for the direct study of non-Caucasian samples. Mapping by admixture linkage disequilibrium (MALD) maps genes in individual affected cases-only (not families) from a population that is comprised of a mixture of ancestries (e.g. African Americans). While population substructure is a weakness for traditional genetic analysis, it is the foundation of MALD. In essence, we statistically infer ancestry of each chromosomal segment as coming from one subpopulation or the other. Thus for each genetic location, a single African American person with autism will carry 0, 1, or 2 DNA segments of West African ancestry (and the converse number of European segments). Locations relevant to autism etiology will be observed as a statistical excess of ancestral DNA from either population. We are casting this statistical genetic problem into the PPL framework
The logical consequence of the high cost and difficulty of determining protein structures experimentally, has been the production of numerous methods which, with varying levels of success, attempt to use protein sequences to predict their resultant structures. The focus of many of these has been the determination of structural constraints from apparently co-evolved mutations within the protein sequences. Despite compelling logic – structurally stabilizing interactions between amino-acid pairs should encourage the change of the paired partner if one of the pair changes properties – the pursuit of such constraints from patterns of co-evolution has met with limited success. Many protein families contain multiple pairs of amino acids with quite highly correlated co-evolved mutations, which nonetheless have no structural contact. The inescapable consensus has been that there is simply too little information contained in the correlated mutations to widely use them for making predictions about structural constraints. In pursuing an unrelated topic examining correlated mutations, my laboratory has made an observation that challenges this consensus. We observed that in many regions where there are structural contacts, many sequence families show an abundance of weakly co-evolved mutations, and, using our visualization tool, this abundance of weakly correlating residue identities takes on a distinct visual pattern. The pattern derived entirely from the multiple-sequence family alignment – is sufficiently distinct that it appears to be unmistakable, even to the untrained eye. We have literally shown these visualizations to art students with no information at all about what they were looking at, or what we were looking for, and had them gravitate directly to portions of the visualization that were related to structural contacts. Because of the consistent occurrence of these visualizable patterns, and because the users who have tested our visualization method consistently identify the same pattern regions in any given visualization, we are confident that this method generalizes across protein families. More importantly, by studying the patterns from humans, we expect that an algorithm can be trained to make the same predictions. Through the aims of this proposal we will validate that these patterns are present and indicative of structural relationships across a broad spectrum of protein sequence families, determine optimal visualization parameters to allow the best prediction of constraints, and develop a pattern-mining application which can learn the how users make their predictions. We will incorporate our findings into an online, WWW-accessible application for determining these constraints, and make the database of predictions available online for additional study.
The BCMM houses the data-coordinating center for two large autism genetics research consortia. Dr. Vieland serves as Director of the DCC and as a principle investigator on both projects. The first is the Autism Genetics Cooperative (AGC), which encompasses 10 institutions from the United States. The second is an international consortia of consortia, of which the AGC is one, called the Autism Genome Project (AGP) Scientists from over 50 research centers work together on the project. Initiated by the National Alliance for Autism Research, NAAR (now Autism Speaks), the AGP is funded by international, private and public partners Dr. Vieland also works on three additional autism research projects (two funded by the NIH and one by the Canadian government), studying the genetics of autism from different perspectives. Among other components of these studies, a collection of “extended pedigrees” are being collected. That is, rather than focusing on nuclear families with two individuals (typically children) with autism, as is frequently done, this project is studying families who have at least three individuals with a diagnosis of autism, Asperger's syndrome, or ASD on at least two different branches of their family tree. For example, three brothers who each have a child with autism would be an example of a larger family that would meet criteria for this study. Because these families are rare they have been difficult to locate, but they do exist. Studying just a few of these families can be a much more powerful approach to finding genes than studies of large numbers of nuclear families. In this study we aim to find as many extended pedigrees in North America as possible. We believe that conducting a study of such families may reveal important clues about the genetics of autism.
Using in vitro and in silico approaches, we assessed the relationships between TCR signaling, synapse formation, TCR downregulation, and antigen quality. We found that a peptide that exhibits many hallmarks of a weak agonist (shorter half-life than the wild-type ligand, poor receptor downregulation, decreased p23 to p21 phospho-z ratio, and inefficient cSMAC formation) could stimulate T cells to proliferate more than the wild-type agonist ligand. Results from a computational model suggested that this is because the immunological synapse regulates the kinetics of signaling differently as antigen quality is varied. Although, in general, TCR signaling decreases for a ligand that binds TCR with a shorter half-life, our data suggest that in some cases, the inability to induce cSMAC formation can enhance the ability of some peptides to signal by attenuating TCR downregulation. These results suggest that the quality of a T cell antigen is determined by a complex interplay between many factors.
Recent work suggests that calcium fluxes can be stimulated by a few agonists, and this response is analog in nature. In contrast many other downstream responses require a larger threshold number of agonists (i.e., digital response). We have previously demonstrated that RasGRP-Sos crosstalk is the basis for optimal Ras-ERK activation in lymphocytes. Here we describe the results of synergistic computational and experimental studies which demonstrate that a positive feedback mechanism involved in the Sos-catalyzed activation of Ras results in a threshold for efficient downstream signaling during membrane proximal signaling events. Our results also show that clonal cells subjected to the same amount of stimulation partition in to two subpopulations (characterized by strong and weak ERK activation) because this feedback regulation leads to a bistable response. Our computational analysis predicts that a consequence of this positive feedback regulation is an emergent hysteretic behavior that allows stimulated T cells to sustain a high level of signaling even when the stimulus falls below the original activation threshold. This feature could confer stability of T cell activation to fluctuations in the stimulus or TCR levels.
Schizophrenia and bipolar disorders are complex diseases with evidence for multiple susceptibility genes. We know from biology in general, and neurobiology in particular, that single genes do not function in a vacuum; they are parts of intricately connected and carefully balanced networks. Yet, largely due to necessity, our attempts to dissect the genetics of schizophrenia have relied on examining one gene at a time, ignoring the importance of the genetic context in which a given gene functions. In this research we seek to move beyond the one-gene-at-a-time approach to understanding illness, to acknowledge the biological complexity that we are trying to understand, and to apply multidisciplinary methods to leverage existing findings into an entry point that can be used to begin to reveal the genetic architecture of this complex disorder. Building on past success, we are using a novel genetic analysis method to map genes that interact with known risk alleles. This will be followed by fine-mapping and candidate gene evaluation using both statistical and biological methods. The biology of newly identified susceptibility genes will be further characterized in vitro and in vivo. This work involves close collaboration with Linda Brzustowicz’s laboratory at Rutgers University.
Specific language impairment (SLI) is an idiopathic neurodevelopmental complex disorder consisting of clinically depressed language ability despite normal hearing, education and intelligence. Approximately 5-7 percent of school age children meet these criteria and represent the largest portion of children receiving special education services within the nation’s public school system costing on average an additional $2400/year per child while incurring less tax revenue due to lost educational opportunities. SLI is consistently heritable, defined either categorically or quantitatively (Bishop et al. 1995; Tomblin and Buckwalter 1998). To date, two genome-wide scans for SLI have been conducted (Bartlett et al. 2002; SLI_Consortium 2002). In a previous funding period we found compelling evidence for a susceptibility allele within our families selected for SLI on 13q21-22 (Bartlett et al. 2002) with a PPL of 53% (LOD=3.92). This finding was followed up with a replication PPL of 17 percent (LOD=2.62) (Bartlett et al. 2004); joint analysis of both datasets shows strong evidence for linkage to 13q21 with a PPL of 96.9 percent (LOD=7.86). We are executing a series of complementary approaches to assist in understanding the highly complex interrelationships between language and reading measures and how those relationships may underlie the SLI. To this end we are employing new multivariate approaches which are expected to 1) improve our understanding of how quantitative language and reading measures, some perhaps representing different etiologic mechanisms, relate to SLI as a diagnosis and 2) increase power to detect novel loci. Our strong linkage signal on 13q21 provides a solid foundation to perform a series of linkage/association analyses using uni- and multi-variate phenotypes including both categorical affection status and quantitative measures of language and related skills in order to dissect the multiple, heterogeneous, pathways to an SLI diagnosis.
Recurrent Urinary Tract infections in women are typically thought to be caused by repeated re-introduction of infectious bacteria. Nationwide Children's Hospital investigator Dr. Sheryl Justice has proposed that recurrent infections are instead caused by a residual population of the original colonizing bacteria which become quiescent and evade the host's immune response, persisting intracellularly within the bladder epithelium until such time as some event induces their reactivation, whereupon a new cycle of infection begins. Preliminary evidence from mouse models suggests that the primary determinant of survival is a morphological change (filamentation) undergone by a subset of the bacterial population, which apparently protects them from phagocytosis and possibly other host-immune effectors. The signals that mediate the morphological fates of the individual bacteria are so-far unknown. In this collaborative project we are developing a computer-vision and object-recognition application, which will help to quantitate intracellular bacterial colony architecture, organization and composition. The computational problem involves volume visualization and shape analysis within reconstructed 3D models extracted by optical microscopy. It is analogous to a visual attempt to quantitate the composition and organization of a plate of mixed rice and spaghetti.
Detection of different extra-cellular stimuli leading to functionally distinct outcomes is ubiquitous in cell biology, and is often mediated by differential regulation of positive and negative feedback loops that are a part of the signaling network. In some instances, these cellular responses are stimulated by small numbers of molecules, and so stochastic effects could be important. Therefore, we studied the influence of stochastic fluctuations on a simple signaling model with dueling positive and negative feedback loops. The class of models we have studied is characterized by single deterministic steady states for all parameter values, but the stochastic response is bimodal; a behavior that is distinctly different from models studied in the context of gene regulation. For example, when positive and negative regulation is roughly balanced, a unique deterministic steady state with an intermediate value for the amount of a downstream signaling product is found. Yet, for small numbers of signaling molecules, stochastic effects result in a bimodal distribution for this quantity, with neither mode corresponding to the deterministic solution; i.e., cells are in "on" or "off" states, not in some intermediate state. For a large number of molecules, the stochastic solution converges to the mean-field result. When fluctuations are important, we find that signal output scales with control parameters "anomalously" compared to mean-field predictions. The necessary and sufficient conditions for the phenomenon we report are quite common. So, our findings are expected to be of broad relevance, and suggest that stochastic effects can enable binary cellular decisions.
Using classical computational chemistry techniques, the time required for docking small molecules to large molecules (i.e. repressor molecules to enzymes) is dominated by electrostatic force calculations. This computation is expensive due to both the unavoidable mathematical complexity of the calculation, and the rotational and translational degrees of freedom that must be afforded the system to produce a realistic simulation of the small molecule's trajectory as it approaches, and ultimately docks with the large molecule. These factors conspire to produce a system where many time-intensive instructions must be executed at each timepoint of a docking simulation, and where the expensive results can neither be pre-computed, nor cached for re-use, due to the relative movement of the molecules between timepoints.
Despite the necessity for numerous, expensive calculations for physically correct simulations of docking, we propose that computational screens that will accept or reject potential binding candidates from a library, can be accomplished with fewer, less time-consuming calculations. Using an extension of the FlatWorld surface mapping proposed by Ray (Rustbelt RNA 1998), we produce topologically planar descriptions of macromolecular and candidate ligand surfaces. These descriptions capture local topography and electrostatic configuration as projected on a plane. The molecule is effectively "skinned." We propose that these dramatically simplified descriptions contain sufficient detail to be used as a rapid binding pre-screen, to eliminate completely non-viable binding candidates, while eliminating at least 3 degrees of freedom which must be explored for the calculation.
Statistical methods are used to situate disease genes on maps of the human genome, to characterize the action of those genes on clinically relevant phenotypes, and to understand relationships among genes in influencing features of diseases. The Posterior Probability of Linkage (PPL) framework treats trait model parameters as nuisance parameters and integrates them out of the likelihood rather than fixing them at arbitrary values. This framework has been successfully applied to genetic data analysis for various genetic diseases, such as schizophrenia (SZ), autism, and autoimmune thyroid disease, among others. We have implemented this framework in the software package KELVIN, for use in linkage and/or LD (association) mapping in application to pedigrees, “trios,” and/or case-control data; with a flexible set of models for handling genetic complexity, including heterogeneity, gene-gene interactions, imprinting, sex or age effects, etc.. New functionality for handling other interesting aspects of genetic architecture are implemented on an ongoing basis.
Kelvin’s approach to modeling genetic architecture is computationally intensive and requires continuous software development and extensive evaluation via simulations, which in turn require considerable computational resources. The primary computational challenge is integration of moderately high-dimensional functions lacking closed form solution. While approximate techniques such as MCMC are one possibility, we prefer a very different and more directly mathematical (rather than stochastic) approach, DCUHRE, a sub-region adaptive algorithm using an embedded family of fully symmetric multiple quadrature rules. These rules have the property that they integrate polynomials, up to a certain degree based on the number of points in the rule, exactly over hypercubes. DCUHRE also adapts to the integrand by iteratively subdividing the domain of integration into subregions based upon error estimates of the integral approximations in current subregions. This results in evaluating the integrand more frequently at points where significant contributions to the integral occur, and by the same token, foregoing unnecessary evaluations elsewhere. Meaningful error estimates are embedded in the calculation and can be used to adjust stopping criteria. We are actively adapting DCUHRE to the idiosyncratic aspects of working with genetic likelihoods. This work also presupposes efficient methods for constructing, storing, and evaluating genetic likelihoods represented as high-order polynomials. Ongoing software engineering focuses on extending the complexity of the calculations that Kelvin can do in order to handle problems ranging from multipoint likelihood calculations in very large and complex pedigrees to analyses involving multiple interacting loci.
Many powerful tools have been created to detect and describe the similarities between multiple nucleic acid or multiple protein sequences. Frequently these take the form of a sequence consensus, expressing either simple most-popular positional identities, positional identities with allowances for varying positions, or some type of statistical description of the positional frequency characteristics of the defining sequence family. Despite the fact that some provide intuitively interpretable descriptions of the consenses themselves, they typically do not give the viewer any information about regions of the sequence that might have inter-positional dependencies, and that therefore do not obey a strict consensus behavior. Such non-consensus behavior may be related to inter- or intra-molecular structural interactions, to phylogentic relationships, or to less understood biological causes. Recognizing these interactions can be useful for developing better motif and family descriptions, or for defining possible structural constraints on molecular families with unknown structure.??We have developed MAVL (Multiple Alignment Variation Linker) and StickWRLD. MAVL is our Web-based application for detecting and displaying correlations in biomolecular sequence families. MAVL examines all positional pairs in each of a collection of pre-aligned sequences and determines any pairs that occur with unexpected frequency, and constructs either a static, downloadable VRML graph of the alignment properties, or a JAVA-powered interactive interface to the alignment. MAVL/StickWRLD can be used through its Web interface. This research has produced several publications in Nucleic Acids Research, including an invited cover. Additional information is available on the MAVL/StickWRLD Web site.