Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Authors Petr Procházka, Jan Holub



PDF
Thumbnail PDF

File

LIPIcs.WABI.2017.20.pdf
  • Filesize: 0.63 MB
  • 13 pages

Document Identifiers

Author Details

Petr Procházka
Jan Holub

Cite As Get BibTex

Petr Procházka and Jan Holub. Byte-Aligned Pattern Matching in Encoded Genomic Sequences. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.20

Abstract

In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern m. A simple O(m)-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block q-gram inverted index. The block q-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.

Subject Classification

Keywords
  • genomic sequences
  • pattern matching
  • q-gram inverted index

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ricardo Baeza-Yates and Gaston H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74-82, October 1992. Google Scholar
  2. Ricardo A. Baeza-yates. Text retrieval: Theory and practice. In In 12th IFIP World Computer Congress, volume I, pages 465-476. Elsevier Science, 1992. Google Scholar
  3. Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762-772, October 1977. Google Scholar
  4. Maxime Crochemore and Wojciech Rytter. Text Algorithms. Oxford University Press, Inc., New York, NY, USA, 1994. Google Scholar
  5. B. Dömölki. An algorithm for syntactical analysis. Computational Linguistics, 3:29-46, 1964. Hungarian Academy of Science, Budapest. Google Scholar
  6. Simone Faro and M. Oğuzhan Külekci. Fast and flexible packed string matching. J. of Discrete Algorithms, 28(C):61-72, September 2014. Google Scholar
  7. Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online string matching. In Nelma Moreira and Rogério Reis, editors, Implementation and Application of Automata: 17th International Conference, CIAA 2012, Porto, Portugal, July 17-20, 2012. Proceedings, pages 149-158. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. Google Scholar
  8. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS'00, pages 390-398, Washington, DC, USA, 2000. IEEE Computer Society. Google Scholar
  9. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proc. of the Thirty-second Annual ACM Symposium on Theory of Computing, pages 397-406, New York, NY, USA, 2000. Google Scholar
  10. R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501-506, 1980. Google Scholar
  11. Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast Pattern Matching in Strings. SIAM Journal on Computing, 6(2):323-350, March 1977. Google Scholar
  12. Sebastian Kreft and Gonzalo Navarro. LZ77-like compression with fast random access. In Proceedings of the 2010 Data Compression Conference, DCC'10, pages 239-248, Washington, DC, USA, 2010. IEEE Computer Society. Google Scholar
  13. Udi Manber. A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inf. Syst., 15(2):124-136, April 1997. Google Scholar
  14. Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In Proceedings of the First Annual ACM-SIAM Symposium on Disc. Algorithms, SODA'90, pages 319-327, Philadelphia, USA, 1990. Society for Industrial and Applied Mathematics. Google Scholar
  15. Udi Manber and Sun Wu. Glimpse: A tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC'94, page 4, Berkeley, CA, USA, 1994. USENIX Association. Google Scholar
  16. Gonzalo Navarro, Edleno Silva de Moura, Marden S. Neubert, Nivio Ziviani, and Ricardo A. Baeza-Yates. Adding compression to block addressing inverted indexes. Inf. Retr., 3(1):49-77, 2000. URL: http://dx.doi.org/10.1023/A:1009934302807.
  17. Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, CPM'98, pages 14-33, London, UK, UK, 1998. Springer-Verlag. Google Scholar
  18. Simon J. Puglisi, W. F. Smyth, and Andrew Turpin. Inverted files versus suffix arrays for locating patterns in primary memory. In Fabio Crestani, Paolo Ferragina, and Mark Sanderson, editors, String Processing and Information Retrieval: 13th International Conference, SPIRE 2006, Glasgow, UK, October 11-13, 2006. Proceedings, pages 122-133. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. Google Scholar
  19. J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In Proc. of the 15th International Symposium on String Processing and Information Retrieval, SPIRE'08, pages 164-175, Berlin, Heidelberg, 2009. Springer-Verlag. Google Scholar
  20. Daniel M. Sunday. A very fast substring search algorithm. Commun. ACM, 33(8):132-142, August 1990. Google Scholar
  21. J. Tarhio, J. Holub, and E. Giaquinta. Technology beats algorithms (in exact string matching). CoRR, abs/1612.01506, 2016. URL: http://arxiv.org/abs/1612.01506.
  22. Branislav Ďurian, Jan Holub, Hannu Peltola, and Jorma Tarhio. Tuning bndm with q-grams. In Proceedings of the Meeting on Algorithm Engineering &Expermiments, pages 29-37, Philadelphia, USA, 2009. Society for Industrial and Applied Mathematics. Google Scholar
  23. P. Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1-11, Oct 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail