Faculty Projects for the 2008 Summer Bioinformatics Institute
we will be adding
new projects as they become available
John Carlis
Although relatively small for a database management system,
the size of data sets produced by gene expression microarray experiments
induce analytical and visualization challenges. Microarray data sets are
comprised of levels of gene expression. Of particular interest to biologists
are genes that are differentially expressed, that is, significantly altered
up or down levels. However, biologically relevant factors in the form of
annotations are needed to make sense of the data. Gene Ontology (GO) is a
consortium that provides annotations using a consistent terminology for
genes and gene products. GO has three organizing principles of biological
relevant factors: cellular component, biological process, and molecular
function. A gene product can perform one or more molecular functions in
one or more biological processes. In addition, it may be active in one or
more cellular components. Several tools have been developed to facilitate
biologists analysis of microarray gene expression data sets. However, they
are deficient in taking into account biologically relevant annotations
contained in GO.
The goals of this project are twofold: First, we intend to combine HIV
microarray gene expression experimental data and GO data in a relational
database to form a data set that will facilitate the execution of novel
analyses. Second, we intend to explore and develop novel visualization
techniques which will allow biologists to further analyze these combined
data sets.
John Crow
Semantic web technologies for
distributed research informatics
A highly visible area of bioinformatics involves the development of software
tools used directly by researchers to explore their data sets and to look up
specialized reference information. For the most part, these tools attempt to
link your data to existing information. But where does the underlying
information come from?
A distributed information model views its world as a community of autonomous
information providers and consumers, and there the software tool a
researcher is using is a consumer of information. Semantic web technologies
are useful in distributed information models. In this project we will
explore the use of semantic web technologies, specifically RDF, OWL,
and RDF query engines, in the creation of information utilities describing
human SNPs. Information providers and consumers will be created, and the
roles of ontologies, metadata, and queries examined. Due to nature of this
effort, a good background in Java, Ruby, or Perl is required.
Kevin Dorfman
Graphical User Interface for Brownian
Dynamics Simulations of Polymers and Biomolecules
Brownian dynamics is a powerful method for simulating the motion of polymers
and biomolecules (such as DNA) as they move through complicated geometries,
such as a gel. Our group is interested in developing new methods for
separating DNA in very small scale structures, and Brownian dynamics
is one of the tools that we use to theoretically investigate possible
separation techniques. The goal of this project is to develop a graphical
user interface (GUI) that will allow users that are unfamiliar with the
simulation code to still take advantage of the method. The interface will
allow the user to construct the biomolecule and the surrounding environment,
input the force fields governing the motion, and then visualize the dynamical
results.
Required Skills: Familiarity with some structured
programming language.
Modeling Basement Membrane
The basement membrane forms the scaffolding for tissues in the body,
and failure of the basement membrane is associated with a number of
diseases. We are interested in developing simple computer models of
basement membrane based on known composition and various hypothesized
structures, which will be used to perform simulated mechanical tests
on model membranes to predict the macroscopic mechanical behavior,
which can then be compared to experimental studies. The intern for
this project will be involved in the coding of Monte Carlo models of
basement membrane, running the code and evaluating the results. The
project is a collaboration between V. Barocas (Biomed Eng), K.
Dorfman (Chem Eng), and Y. Segal (Medicine); it is funded by NIH (R21
GM082823).
Required Skills: Familiarity with some structured
programming language. No prior knowledge of simulation methods is
required.
Implementation of ESPResSo for DNA
Electrophoresis
The simulation package ESPResSo
(http://espressowiki.mpip-mainz.mpg.de/wiki/index.php/Main_Page)
allows
one to perform molecular dynamics simulations of bead-spring models of
polymers. Most notably, it permits the use of advanced methods for
implementing hydrodynamic interactions between the polymer and the
fluid and long-range electrostatic interactions. We are interested
in using this package to complement the Brownian dynamics method
discussed above. The goal of this project will be to understand how
this simulation package operates and develop scripts that will allow
us to utilize it for simulating DNA electrophoresis.
Required Skills: Familiarity with some structured
programming language.
Lynda Ellis
Encoding Metabolic Logic
Prediction of microbial metabolism is important
for annotating genome sequences and for understanding the fate of
chemicals in the environment.
A metabolic Pathway Prediction System has been developed that is freely
available on the world wide web (http://umbbd.msi.umn.edu/predict/).
It recognizes the organic functional groups found in a compound and
predicts transformations based on metabolic rules. These rules are based
on reactions catalogued in the University of Minnesota
Biocatalysis/Biodegradation Database (UM-BBD). The rule-based nature of
the Pathway Prediction System makes it transparent, expandable, and
adaptable. Join with us to expand the UM-BBD and its predictive system;
learn metabolic logic, and user interface design. Requires knowledge of
college-level organic chemistry and computer programming (Java and/or
Perl).
Yiannis Kaznessis
Design of genetic regulatory
networks
We want to learn to command cells to make specific proteins. These
proteins can be catalysts, synthesizing specialty chemicals like
pharmaceuticals, or sensor proteins for biological weapons like
anthrax, or therapeutic proteins like insulin. To do this we need
to understand dynamic gene regulation (when and how DNA gives protein)
and to design gene networks (genes influencing the expression of other
genes) that perform the tasks at hand, in response to our signals. We
have written a code that simulates gene networks and can be used to
design novel gene circuits, such as the oscillator, the digital clock
and The student’s challenge would be to construct interesting
designs of genetic circuits that perform specific tasks. Two recent
examples of successful designs have been a switch and an oscillator.
Future applications of genetic circuits might include biosensors,
targeted drug delivery, molecular machines, and biochemical
factories.
Design of Antimicrobial Peptides
Antimicrobial peptides are molecules produced by the immune system of
animals and plants are being considered potential novel antibiotic
candidates to combat emerging drug-resistant bacterial strains. The
peptides are known to kill bacterial cells by direct membrane attack.
Most of known AMPs are also toxic killing mammalian cells, again by
direct membrane attack. The mechanism of action of these peptides is
not yet clear. We work towards designing and implementing computational
solutions to fill the void. We use molecular dynamics simulations of
peptides in mammalian and bacterial model membranes to determine the
structural characteristics responsible for activity and toxicity. We
use this knowledge in designing new peptides that retain their
antimicrobial activity but are not toxic.
Develop Wikigene
Using information in SQL the goal of this project is develop the
Wikigene as a community-wide effort to catalogue gene regulatory
thermodynamic and kinetic constant interactions. Working knowledge of
SQL, HTML, Java and an understanding of Wikis is necessary.
Synthetic Biology
The University of Bioinformatics Bioinformatics Summer Institute will
participate in the Summer 2008 International Genetically Engineered
Machine Competition
(www.igem2008.com). iGEM is an undergraduate
Synthetic Biology competition. Student teams are given a kit of
biological parts at the beginning of the summer. Working at their own
schools over the summer, they use these parts and new parts of their
own design to build biological systems and operate them in living
cells. During the first weekend of November, they present their work
at the iGEM Competition Jamboree at MIT and have a chance to win
prizes. They add their new parts to the Registry of Standard
Biological Parts for the students in the next year's competition.
Vipin Kumar
Data mining is useful for discovering interesting patterns
in large data sets. These patterns can be formulated as rules, clusters, or
sets of items that frequently occur together. Although we have applied data
mining to various areas, recently we have been engaged in a number of
projects that involve molecular biology, medicine, or both. For instance,
we have used the patterns discovered by data mining to identify groups of
genes that are similarly expressed under a specific set of conditions and
to discover connections between genetic variation and disease.
Three potential projects are listed below. All projects require some
knowledge of programming (e.g., Perl, Python, C/C++, Java, or MATLAB).
Students should contact us for more details before making a final selection
of a particular project, but should feel free to just select working in
Prof. Kumars group if they are flexible with respect to project
assignment.
Data Mining for Connecting SNPs and Disease
One of the important potential benefits of the genetic revolution is the
possibility of personalized medicine, i.e., using detailed genomic
information about a person for the detection, treatment, or prevention
of disease. The recent availability of individual genomic information
typically in the form of Single Nucleotide Polymorphisms (SNPs) offers
one route for making this possibility a reality. In particular, the
increasing availability of SNP data has created opportunities for
discovering important connections between disease and genomic factors.
Although there has been some success in finding such connections with
currently available techniques, these approaches have a number of
limitations and are most useful for finding connections involving only
one or two SNPs. This project will investigate the use of data mining
techniques to find more general patterns that capture connections between
SNPs and disease, including patterns that may involve a relatively large
number of SNPs and patterns that show variation from patient to patient,
either because of missing data or natural variation.
Connecting Cognitive Performance with Brain and Genetic
Characteristics
The data available for this project comes from a study of normal adolescent
brain development in 200 healthy subjects, age 9-23. The data produced by
this study is very rich, consisting of MRI brain data, the results of
neurocognitive tests, and Single Nucleotide Polymorphism (SNP) data. The
MRI data contains information about various locations in the brain of a
subject, but has been processed to generate volumes, thickness, and surface
areas of specific neuroanatomical regions that will be the focus of the
proposed analyses. The neurocognitive data consists of scores and times
from a neurocognitive battery of tests that measure the performance of
an individual on general intellectual ability, motor speed and coordination,
recognition, attention and short-term memory, and executive function. The
SNP data records, for each individual, the presence or absence of a set of
SNPs that have been predetermined to have potential significance in
adolescent brain development. However, the complexity and multi-modal
nature of this data pose significant challenges for data analysis and
data mining techniques. The goal of this project is to apply a variety
of data mining and data analysis techniques to this data to find interesting
relationships among the three sets of data that could serve as the basis for
hypotheses for additional investigation.
Vipin Kumar and Judith Berman
Discovery of Transcription Modules from Gene Expression
Data
The analysis of large microarray gene expression datasets holds out the
promise of improvements in, for example, identification of the relationships
among groups of genes within a specific genome, prediction of the functions
of anonymous genes, construction of functional networks from these
relationships, and differential analysis across genomes of related, but
distinct organisms. Much of this task can be formulated as a problem of
finding patterns in gene expression data. Using data mining, we plan to
attack this problem in a systematic manner, thus ensuring the correctness
and completeness of the results. This project will focus on a type of
pattern known transcription modules, which are mathematical models of
transcription factors that can be represented as a submatrix of genes
and conditions in the gene expression matrix. The resulting modules can
be organized in a hierarchy that provides insight into the key functional
components of the organism. Hierarchies of transcription modules from two
different organisms can also be compared to provide additional biological
insight. This project will focus on a variety of tasks related to better
generating, using, and comparing hierarchies of transcription modules.
The primary source of data will be gene expression data for S. cerevisiae
and C. albicans, but although the project will focus on two species of yeast,
the computational techniques will have direct application for gene expression
data from any organism.
Nathan Springer
Understanding the mechanisms of
intra-specific regulatory variation
The goal of this research project is to understand the
molecular basis of phenotypic variation within a species. In other
words, why do different strains or breeds of a species, i.e. a poodle
and a pitbull, exhibit different phenotypes? This variation can be
either quantitative (affecting the amount of a gene produced) or
qualitative (affecting the nature of the gene produced). My lab uses
allele-specific expression assays to study the prevalence and
mechanisms of quantitative variation. We are interested in
understanding how novel gene expression states arise and how they are
maintained. My lab is studies the mechanisms that lead to
quantitative variation. We are gathering data on the relative
expression of two alleles in a heterozygote for a set of 500 genes.
The project would involve the construction of a database for data
handling and analysis. In addition, the student would use
bioinformatics tools to characterize the genes that are being assayed
and would have opportunities for lab work.
Nevin Young
Assembling a genome sequence and
discovering novel genes
The foundation for most bioinformatics research is genome sequence.
Efficiently assembling a coherent genome sequence out of thousands of
short "reads" remains a challenging informatics problem. Sequencing
projects for many complex organisms are now underway, including a model
plant sequencing project at the University of Minnesota called "Medicago
truncatula." In this project, we direct and manage sequencing data coming
from several international sequencing centers and use that data to
synthesize a reference genome sequence. Using the assembled genome
sequence, our partners and we carry out gene annotation — the
discovery of genes and gene features along the sequence — and
present the sequence through a rich web interface. Of particular
interest is the discovery of genes that have not been previously described.
Potentially, these genes hold the key to new and novel biological functions.
In addition to laboratory-based experiments to explore function, we take
advantage of diverse and powerful software to reveal the properties of
these new and novel genes.
Comparative Genomics of Plants and
Disease Resistance Genes
An important problem in genomics and bioinformatics is tracing how
the genomes of related species correspond to one another and how
they have rearranged and diversified. This challenge requires an
integrated approach to the analysis of heterogeneous genomic data.
Our lab is working on the integration of biological and bioinformatic
data in order to compare the evolution of multiple plant genomes,
focusing on whole genome comparisons and complex clusters of
rapidly-evolving disease resistance genes. We are developing
informatic tools to characterize these genes, especially the
integration of separate software packages that visualize functional,
chromosomal, and evolutionary data about gene families. This project
relies on new data coming out of parallel biological research actively
underway in our lab. The programming aspect of the project is mostly
in perl and Java, so some programming experience in one of these
languages would be highly desirable. The outcome of this research
will be new bioinformatic tools to analyze genes and genome evolution,
as well as deeper insights into the mechanisms of plant disease
resistance.
|