Taxonomic Classification of Bacteria Using Common Substrings

Mehmet Can, Osman Gürsoy


For the taxonomic classification of microbes, 16S ribosomal RNA (rRNA) gene sequences are widely used in environmental microbiology as reliable markers. Although the massive sequencing of 16S rRNA gene amplicons encompassing the full length of genes is not easy, because of the limitations of the current sequencing techniques, in databases Greengenes, RDP, and SILVA millions of rRNA gene sequences are uploaded. In this research, first a new similarity measure LCSS, for full length genes is defined. Then it is found that sequences reported for the same bacteria species demonstrate around 53% average sequence similarity in Greengenes and SILVA databases, while average similarity among genes reported for different bacteria species is around 15% only. This is 63%, and 20% respectively at genus level for the three data bases Greengenes, RDP, and SILVA. Hence, species, and genus-specific sequences constitute useful targets for diagnostic assays and other scientific investigations. In the present research, the built in function LongestCommonSubsequence is used repeatedly in computer algebra package MATHEMATICA to create an in silico pipeline for taxonomic classification uploaded new full-length sequences. Conclusions: Our results suggest that LongestCommonSubsequence similarity can be used for taxonomic classification of unknown bacteria through their full 16S ribosomal RNA (rRNA) gene sequences.


16S rRNA gene; LongestCommonSubsequence; Taxonomic classification

Full Text:




  • There are currently no refbacks.

Copyright (c) 2019 Mehmet Can, Osman Gürsoy

ISSN 2233 -1859

Digital Object Identifier DOI: 10.21533/scjournal

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License