Technical Report PHD-2010-06

Title: Computational Methods for Metagenomic Analysis
Authors: Itai Sharon
Supervisors: Oded Beja and Ron Pinter
Abstract: Metagenomics is a new field in which genetic material is extracted directly from the environment and is subsequently analyzed by a variety of biological and computational methods. Using metagenomics, it possible to study the vast majority of microbes on earth, of which more than 99% cannot be cultured in the laboratory. Metagenomic data usually consists of many short (100-1,000 bp) DNA sequences, potentially originating from all organisms in the examined environment. Several computational challenges arise as a result, some of them are known from genomics (e.g. DNA assembly, gene prediction and functional analysis) while others are unique to metagenomics (e.g. sequence binning, in which we try to assign sequences to taxonomic bins). Many metagenomic projects have been carried out in recent years, projects that have broadened our understanding of biological processes in a way that was impossible heretofore. On-going and new projects, such as the Global Ocean Sampling (GOS) expedition, promise that the flux of discoveries will increase in the coming years.

In my PhD I chose to focus on two aspects of metagenomics analysis: (i) the statistics of functional analysis of metagenomes, and (ii) the study of genes and gene organizations from metagenomic data. The viewpoint of the first part is global: given a metagenome, we are interested in studying functional characteristics of organisms living in the examined environment which may hint us as for conditions most important in that environment. Based on the Lander-Waterman model for whole genome shotgun sequencing projects we were able to provide a statistical model that accurately estimates the expected number of sequences containing some part of a gene in a metagenome. The model was tested on both simulated and real data, and was shown to provide estimates that are in line with real values. The statistics of pathways is also discussed: in this case a different model was required that will take into account the possibility of genes that participate in more than one pathway.

The second part of this work takes a "local" view: rather than looking at microbial communities in general, we are interested in answering specific questions on specific genes or systems. This part begins with the description of our discovery of Photosystem-I (PSI) gene cassettes on viral genomes. Using metagenomic data from the Global Ocean Survey (GOS) expedition and the Northern Line Islands we were able to show that a gene cassette of eight PSI genes, potentially sufficient for coding all necessary proteins of fully functional PSI, is present on DNA sequences of viral origin. In this work we used several computational tools that were developed by me, some of them novel to this work while others were also used in other works. I will also describe a generalization to this work in which we were able to discover microbial genes on viral genomes in general, using existing and novel methods and strategies.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2010
To the main CS technical reports page

Computer science department, Technion