Gold University of Minnesota M. Skip to main 
  content.University of Minnesota. Home page.

Funded by

NSF logo NIH logo


dtc logo

Bioinformatics Summer Institute

Faculty Projects for the 2008 Summer Bioinformatics Institute

we will be adding new projects as they become available

John Carlis

Although relatively small for a database management system, the size of data sets produced by gene expression microarray experiments induce analytical and visualization challenges. Microarray data sets are comprised of levels of gene expression. Of particular interest to biologists are genes that are differentially expressed, that is, significantly altered up or down levels. However, biologically relevant factors in the form of annotations are needed to make sense of the data. Gene Ontology (GO) is a consortium that provides annotations using a consistent terminology for genes and gene products. GO has three organizing principles of biological relevant factors: cellular component, biological process, and molecular function. A gene product can perform one or more molecular functions in one or more biological processes. In addition, it may be active in one or more cellular components. Several tools have been developed to facilitate biologists analysis of microarray gene expression data sets. However, they are deficient in taking into account biologically relevant annotations contained in GO.

The goals of this project are twofold: First, we intend to combine HIV microarray gene expression experimental data and GO data in a relational database to form a data set that will facilitate the execution of novel analyses. Second, we intend to explore and develop novel visualization techniques which will allow biologists to further analyze these combined data sets.

John Crow

Semantic web technologies for distributed research informatics
A highly visible area of bioinformatics involves the development of software tools used directly by researchers to explore their data sets and to look up specialized reference information. For the most part, these tools attempt to link your data to existing information. But where does the underlying information come from?

A distributed information model views its world as a community of autonomous information providers and consumers, and there the software tool a researcher is using is a consumer of information. Semantic web technologies are useful in distributed information models. In this project we will explore the use of semantic web technologies, specifically RDF, OWL, and RDF query engines, in the creation of information utilities describing human SNPs. Information providers and consumers will be created, and the roles of ontologies, metadata, and queries examined. Due to nature of this effort, a good background in Java, Ruby, or Perl is required.

Kevin Dorfman

Graphical User Interface for Brownian Dynamics Simulations of Polymers and Biomolecules
Brownian dynamics is a powerful method for simulating the motion of polymers and biomolecules (such as DNA) as they move through complicated geometries, such as a gel. Our group is interested in developing new methods for separating DNA in very small scale structures, and Brownian dynamics is one of the tools that we use to theoretically investigate possible separation techniques. The goal of this project is to develop a graphical user interface (GUI) that will allow users that are unfamiliar with the simulation code to still take advantage of the method. The interface will allow the user to construct the biomolecule and the surrounding environment, input the force fields governing the motion, and then visualize the dynamical results.
Required Skills: Familiarity with some structured programming language.

Modeling Basement Membrane
The basement membrane forms the scaffolding for tissues in the body, and failure of the basement membrane is associated with a number of diseases. We are interested in developing simple computer models of basement membrane based on known composition and various hypothesized structures, which will be used to perform simulated mechanical tests on model membranes to predict the macroscopic mechanical behavior, which can then be compared to experimental studies. The intern for this project will be involved in the coding of Monte Carlo models of basement membrane, running the code and evaluating the results. The project is a collaboration between V. Barocas (Biomed Eng), K. Dorfman (Chem Eng), and Y. Segal (Medicine); it is funded by NIH (R21 GM082823).
Required Skills: Familiarity with some structured programming language. No prior knowledge of simulation methods is required.

Implementation of ESPResSo for DNA Electrophoresis
The simulation package ESPResSo
(http://espressowiki.mpip-mainz.mpg.de/wiki/index.php/Main_Page)
allows one to perform molecular dynamics simulations of bead-spring models of polymers. Most notably, it permits the use of advanced methods for implementing hydrodynamic interactions between the polymer and the fluid and long-range electrostatic interactions. We are interested in using this package to complement the Brownian dynamics method discussed above. The goal of this project will be to understand how this simulation package operates and develop scripts that will allow us to utilize it for simulating DNA electrophoresis.
Required Skills: Familiarity with some structured programming language.

Lynda Ellis

Encoding Metabolic Logic
Prediction of microbial metabolism is important for annotating genome sequences and for understanding the fate of chemicals in the environment.

A metabolic Pathway Prediction System has been developed that is freely available on the world wide web (http://umbbd.msi.umn.edu/predict/). It recognizes the organic functional groups found in a compound and predicts transformations based on metabolic rules. These rules are based on reactions catalogued in the University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD). The rule-based nature of the Pathway Prediction System makes it transparent, expandable, and adaptable. Join with us to expand the UM-BBD and its predictive system; learn metabolic logic, and user interface design. Requires knowledge of college-level organic chemistry and computer programming (Java and/or Perl).

Yiannis Kaznessis

Design of genetic regulatory networks
We want to learn to command cells to make specific proteins. These proteins can be catalysts, synthesizing specialty chemicals like pharmaceuticals, or sensor proteins for biological weapons like anthrax, or therapeutic proteins like insulin. To do this we need to understand dynamic gene regulation (when and how DNA gives protein) and to design gene networks (genes influencing the expression of other genes) that perform the tasks at hand, in response to our signals. We have written a code that simulates gene networks and can be used to design novel gene circuits, such as the oscillator, the digital clock and The student’s challenge would be to construct interesting designs of genetic circuits that perform specific tasks. Two recent examples of successful designs have been a switch and an oscillator. Future applications of genetic circuits might include biosensors, targeted drug delivery, molecular machines, and biochemical factories.

Design of Antimicrobial Peptides
Antimicrobial peptides are molecules produced by the immune system of animals and plants are being considered potential novel antibiotic candidates to combat emerging drug-resistant bacterial strains. The peptides are known to kill bacterial cells by direct membrane attack. Most of known AMPs are also toxic killing mammalian cells, again by direct membrane attack. The mechanism of action of these peptides is not yet clear. We work towards designing and implementing computational solutions to fill the void. We use molecular dynamics simulations of peptides in mammalian and bacterial model membranes to determine the structural characteristics responsible for activity and toxicity. We use this knowledge in designing new peptides that retain their antimicrobial activity but are not toxic.

Develop Wikigene
Using information in SQL the goal of this project is develop the Wikigene as a community-wide effort to catalogue gene regulatory thermodynamic and kinetic constant interactions. Working knowledge of SQL, HTML, Java and an understanding of Wikis is necessary.

Synthetic Biology
The University of Bioinformatics Bioinformatics Summer Institute will participate in the Summer 2008 International Genetically Engineered Machine Competition (www.igem2008.com). iGEM is an undergraduate Synthetic Biology competition. Student teams are given a kit of biological parts at the beginning of the summer. Working at their own schools over the summer, they use these parts and new parts of their own design to build biological systems and operate them in living cells. During the first weekend of November, they present their work at the iGEM Competition Jamboree at MIT and have a chance to win prizes. They add their new parts to the Registry of Standard Biological Parts for the students in the next year's competition.

Vipin Kumar

Data mining is useful for discovering interesting patterns in large data sets. These patterns can be formulated as rules, clusters, or sets of items that frequently occur together. Although we have applied data mining to various areas, recently we have been engaged in a number of projects that involve molecular biology, medicine, or both. For instance, we have used the patterns discovered by data mining to identify groups of genes that are similarly expressed under a specific set of conditions and to discover connections between genetic variation and disease.

Three potential projects are listed below. All projects require some knowledge of programming (e.g., Perl, Python, C/C++, Java, or MATLAB). Students should contact us for more details before making a final selection of a particular project, but should feel free to just select working in Prof. Kumars group if they are flexible with respect to project assignment.

Data Mining for Connecting SNPs and Disease
One of the important potential benefits of the genetic revolution is the possibility of personalized medicine, i.e., using detailed genomic information about a person for the detection, treatment, or prevention of disease. The recent availability of individual genomic information typically in the form of Single Nucleotide Polymorphisms (SNPs) offers one route for making this possibility a reality. In particular, the increasing availability of SNP data has created opportunities for discovering important connections between disease and genomic factors. Although there has been some success in finding such connections with currently available techniques, these approaches have a number of limitations and are most useful for finding connections involving only one or two SNPs. This project will investigate the use of data mining techniques to find more general patterns that capture connections between SNPs and disease, including patterns that may involve a relatively large number of SNPs and patterns that show variation from patient to patient, either because of missing data or natural variation.

Connecting Cognitive Performance with Brain and Genetic Characteristics
The data available for this project comes from a study of normal adolescent brain development in 200 healthy subjects, age 9-23. The data produced by this study is very rich, consisting of MRI brain data, the results of neurocognitive tests, and Single Nucleotide Polymorphism (SNP) data. The MRI data contains information about various locations in the brain of a subject, but has been processed to generate volumes, thickness, and surface areas of specific neuroanatomical regions that will be the focus of the proposed analyses. The neurocognitive data consists of scores and times from a neurocognitive battery of tests that measure the performance of an individual on general intellectual ability, motor speed and coordination, recognition, attention and short-term memory, and executive function. The SNP data records, for each individual, the presence or absence of a set of SNPs that have been predetermined to have potential significance in adolescent brain development. However, the complexity and multi-modal nature of this data pose significant challenges for data analysis and data mining techniques. The goal of this project is to apply a variety of data mining and data analysis techniques to this data to find interesting relationships among the three sets of data that could serve as the basis for hypotheses for additional investigation.

Vipin Kumar and Judith Berman
Discovery of Transcription Modules from Gene Expression Data
The analysis of large microarray gene expression datasets holds out the promise of improvements in, for example, identification of the relationships among groups of genes within a specific genome, prediction of the functions of anonymous genes, construction of functional networks from these relationships, and differential analysis across genomes of related, but distinct organisms. Much of this task can be formulated as a problem of finding patterns in gene expression data. Using data mining, we plan to attack this problem in a systematic manner, thus ensuring the correctness and completeness of the results. This project will focus on a type of pattern known transcription modules, which are mathematical models of transcription factors that can be represented as a submatrix of genes and conditions in the gene expression matrix. The resulting modules can be organized in a hierarchy that provides insight into the key functional components of the organism. Hierarchies of transcription modules from two different organisms can also be compared to provide additional biological insight. This project will focus on a variety of tasks related to better generating, using, and comparing hierarchies of transcription modules. The primary source of data will be gene expression data for S. cerevisiae and C. albicans, but although the project will focus on two species of yeast, the computational techniques will have direct application for gene expression data from any organism.

Nathan Springer

Understanding the mechanisms of intra-specific regulatory variation
The goal of this research project is to understand the molecular basis of phenotypic variation within a species. In other words, why do different strains or breeds of a species, i.e. a poodle and a pitbull, exhibit different phenotypes? This variation can be either quantitative (affecting the amount of a gene produced) or qualitative (affecting the nature of the gene produced). My lab uses allele-specific expression assays to study the prevalence and mechanisms of quantitative variation. We are interested in understanding how novel gene expression states arise and how they are maintained. My lab is studies the mechanisms that lead to quantitative variation. We are gathering data on the relative expression of two alleles in a heterozygote for a set of 500 genes. The project would involve the construction of a database for data handling and analysis. In addition, the student would use bioinformatics tools to characterize the genes that are being assayed and would have opportunities for lab work.

Nevin Young

Assembling a genome sequence and discovering novel genes
The foundation for most bioinformatics research is genome sequence. Efficiently assembling a coherent genome sequence out of thousands of short "reads" remains a challenging informatics problem. Sequencing projects for many complex organisms are now underway, including a model plant sequencing project at the University of Minnesota called "Medicago truncatula." In this project, we direct and manage sequencing data coming from several international sequencing centers and use that data to synthesize a reference genome sequence. Using the assembled genome sequence, our partners and we carry out gene annotation — the discovery of genes and gene features along the sequence — and present the sequence through a rich web interface. Of particular interest is the discovery of genes that have not been previously described. Potentially, these genes hold the key to new and novel biological functions. In addition to laboratory-based experiments to explore function, we take advantage of diverse and powerful software to reveal the properties of these new and novel genes.

Comparative Genomics of Plants and Disease Resistance Genes
An important problem in genomics and bioinformatics is tracing how the genomes of related species correspond to one another and how they have rearranged and diversified. This challenge requires an integrated approach to the analysis of heterogeneous genomic data. Our lab is working on the integration of biological and bioinformatic data in order to compare the evolution of multiple plant genomes, focusing on whole genome comparisons and complex clusters of rapidly-evolving disease resistance genes. We are developing informatic tools to characterize these genes, especially the integration of separate software packages that visualize functional, chromosomal, and evolutionary data about gene families. This project relies on new data coming out of parallel biological research actively underway in our lab. The programming aspect of the project is mostly in perl and Java, so some programming experience in one of these languages would be highly desirable. The outcome of this research will be new bioinformatic tools to analyze genes and genome evolution, as well as deeper insights into the mechanisms of plant disease resistance.