Abstract:
Genomics is rich with computational problems where algorithms and
statistical methods can have a big impact on data analysis and biological
discovery. Here, I will present three such problems.
1. Gene Finding. Given a sequenced genome, the first task is to find the
genes. This core bioinformatics problem is still largely open. The set of
human genes, for example, has not been finalized. Here, I will present
CONTRAST, a gene finder based on a CRF/SVM approach, which is the first tool
to show significant improvement in human gene finding by using multiple
sequence alignments as informants.
2. Network Alignment. Protein association networks summarize our knowledge
of which proteins work together in modules and networks to accomplish
complex biological processes. Many global protein interaction networks have
been predicted for organisms ranging from bacteria to human. Here, I will
present Graemlin, a system for comparing networks across organisms and
finding conserved modules - subgraphs of conserved proteins and their
associations.
3. Ancestral Population Inference. Projects like HapMap provide whole-genome
genotypes for diverse populations. Given a genotyped individual, using such
datasets we may attempt to predict the allele-specific population source of
the individual's chromosomes. I will present HAPAA, a tool for accomplishing
this task. Then, I will show that ancestry inference can accurately extract
the source populations of admixtures that happened as far as 20 generations
ago, covering much of the modern history of population movements.