Outlier Detection in BLAST Hits

Authors Nidhi Shah, Stephen F. Altschul, Mihai Pop

Thumbnail PDF


  • Filesize: 1.18 MB
  • 11 pages

Document Identifiers

Author Details

Nidhi Shah
Stephen F. Altschul
Mihai Pop

Cite AsGet BibTex

Nidhi Shah, Stephen F. Altschul, and Mihai Pop. Outlier Detection in BLAST Hits. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 23:1-23:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. The similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. On the other hand, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets. Our experiments demonstrate the potential of our method to be a filtering step before using phylogenetic methods.
  • Taxonomy classification
  • Metagenomics
  • Sequence alignment
  • Outlier detection


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990. Google Scholar
  2. Stephen F. Altschul, John C. Wootton, Elena Zaslavsky, and Yi-Kuo Yu. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol, 6(7):e1000852, 2010. Google Scholar
  3. Michael Brown, Richard Hughey, Anders Krogh, I. Saira Mian, Kimmen Sjölander, and David Haussler. Using Dirichlet mixture priors to derive hidden markov models for protein families. In Ismb, volume 1, pages 47-55, 1993. Google Scholar
  4. J. Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Elizabeth K. Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K. Goodrich, Jeffrey I. Gordon, et al. QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5):335-336, 2010. Google Scholar
  5. James R. Cole, Qiong Wang, Jordan A. Fish, Benli Chai, Donna M. McGarrell, Yanni Sun, C. Titus Brown, Andrea Porras-Alfaro, Cheryl R. Kuske, and James M. Tiedje. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic acids research, page gkt1244, 2013. Google Scholar
  6. Jack A. Gilbert, Janet K. Jansson, and Rob Knight. The Earth Microbiome project: successes and aspirations. BMC biology, 12(1):69, 2014. Google Scholar
  7. Martin Hartmann, Charles G. Howes, Kessy Abarenkov, William W. Mohn, and R. Henrik Nilsson. V-Xtractor: an open-source, high-throughput software tool to identify and extract hypervariable regions of small subunit (16s/18s) ribosomal RNA gene sequences. Journal of Microbiological Methods, 83(2):250-253, 2010. Google Scholar
  8. Daniel H. Huson, Alexander F. Auch, Ji Qi, and Stephan C. Schuster. MEGAN analysis of metagenomic data. Genome research, 17(3):377-386, 2007. Google Scholar
  9. Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London a: mathematical, physical and engineering sciences, 186(1007):453-461, 1946. Google Scholar
  10. Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, 87(6):2264-2268, 1990. Google Scholar
  11. Liisa B. Koski and G. Brian Golding. The closest BLAST hit is often not the nearest neighbor. Journal of molecular evolution, 52(6):540-542, 2001. Google Scholar
  12. Ivica Letunic and Peer Bork. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics, 23(1):127-128, 2007. Google Scholar
  13. Frederick A. Matsen, Robin B. Kodner, and E. Virginia Armbrust. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics, 11(1):538, 2010. Google Scholar
  14. Michio Murata, Jane S. Richardson, and Joel L. Sussman. Simultaneous comparison of three protein sequences. Proceedings of the National Academy of Sciences, 82(10):3073-3077, 1985. Google Scholar
  15. Nam-phuong Nguyen, Siavash Mirarab, Bo Liu, Mihai Pop, and Tandy Warnow. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics, 30(24):3548-3555, 2014. Google Scholar
  16. Mihai Pop, Alan W. Walker, Joseph Paulson, Brianna Lindsay, Martin Antonio, M. Anowar Hossain, Joseph Oundo, Boubou Tamboura, Volker Mai, Irina Astrovskaya, et al. Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. Genome biology, 15(6):R76, 2014. Google Scholar
  17. David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. Google Scholar
  18. David Sankoff and Robert J. Cedergren. Simultaneous comparison of three or more sequences related by a tree. Time warps, string edits, and macromolecules: the theory and practice of sequence comparison/edited by David Sankoff and Joseph B. Krustal, 1983. Google Scholar
  19. Thomas D. Schneider, Gary D. Stormo, Larry Gold, and Andrzej Ehrenfeucht. Information content of binding sites on nucleotide sequences. Journal of molecular biology, 188(3):415-431, 1986. Google Scholar
  20. Raúl Y. Tito, Simone Macmil, Graham Wiley, Fares Najar, Lauren Cleeland, Chunmei Qu, Ping Wang, Frederic Romagne, Sylvain Leonard, Agustín Jiménez Ruiz, et al. Phylotyping and functional analysis of two ancient human microbiomes. PLoS One, 3(11):e3703, 2008. Google Scholar
  21. Susannah G. Tringe and Philip Hugenholtz. A renaissance for the pioneering 16S rRNA gene. Current opinion in microbiology, 11(5):442-446, 2008. Google Scholar
  22. Susannah Green Tringe, Christian Von Mering, Arthur Kobayashi, Asaf A. Salamov, Kevin Chen, Hwai W. Chang, Mircea Podar, Jay M. Short, Eric J. Mathur, John C. Detter, et al. Comparative metagenomics of microbial communities. Science, 308(5721):554-557, 2005. Google Scholar
  23. Qiong Wang, George M. Garrity, James M. Tiedje, and James R. Cole. Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16):5261-5267, 2007. Google Scholar