14 Apr 2008
New high-throughput sequencing technologies such as Roche 454 and Solexa make it realistic to expect all genomes of bacterial pathogens to be sequenced. High level of automatization and a significant reduction of price allow to use these technologies as a routine for diagnostic and monitoring of pathogens in field. New ‘laboratory-on-a-chip’ (LOC) technologies are expected soon which are going to revolutionize the study and monitoring of environmental microflora. The progress in development of new technologies challenges bioinformaticians to provide more powerful approaches for a large-scale simultaneous analysis of multiple short DNA reads to identify and monitor species of interest. We focused on development of computer-based algorithms to address the problems of clustering and identification of environmental sequences generated by modern high-throughput sequencers. We developed an algorithm of self-organizing hierarchical clustering of multiple DNA reads originating from different bacterial species. The oligonucleotide compositional bias of the environmental sequences was used as a genomic signature to cluster and identify the DNA fragments. The program is scalable for analysis of large datasets (up to 10,000 reads). The program showed rather high performance (3500 reads per 40 min) with almost linear dependence of the total time of analysis on the number of analyzed sequences. The sequences were clustered in accordance with the phylogeny of microorganisms they derived from. In parallel a database of oligonucleotide signatures from 8 to 14 bp calculated for all sequenced genomes was developed. Apparent redundancy of signature oligos is important for identification of short metagenomic reads. Discovery of unique oligos and patterns of infrequent oligos allows development of a tool to search the most appropriate DNA probes for diagnostic chips.
Oleg Reva