On-Line Pattern Matching on Similar Texts

Grossi, Roberto; Iliopoulos, Costas S.; Liu, Chang; Pisanti, Nadia; Pissis, Solon P.; Retha, Ahmad; Rosone, Giovanna; Vayani, Fatima; Versari, Luca

doi:10.4230/LIPIcs.CPM.2017.9

Abstract

Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.

Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. URL: http://dx.doi.org/10.1145/360825.360855.
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215(3):403-410, 1990. URL: http://dx.doi.org/10.1016/S0022-2836(05)80360-2.
Ricardo A. Baeza-Yates and Chris H. Perleberg. Fast and practical approximate string matching. Inf. Process. Lett., 59(1):21-27, 1996. URL: http://dx.doi.org/10.1016/0020-0190(96)00083-X.
Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32(4):497-504, 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btv603.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, 2007. URL: http://dx.doi.org/10.1017/cbo9780511546853.
Martin Farach. Optimal suffix tree construction with large alphabets. In Anna Karlin, editor, Proceedings of the 38th Annual Symposium on Foundations of Computer Science (FOCS 1997), pages 137-143. IEEE Computer Society, 1997. URL: http://dx.doi.org/10.1109/SFCS.1997.646102.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: http://dx.doi.org/10.1017/CBO9780511574931.
Guillaume Holley, Roland Wittler, and Jens Stoye. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol., 11:3, 2016. URL: http://dx.doi.org/10.1186/s13015-016-0066-8.
Jan Holub, William F. Smyth, and Shu Wang. Fast pattern-matching on indeterminate strings. J. Discrete Algorithms, 6(1):37-50, 2008. URL: http://dx.doi.org/10.1016/j.jda.2006.10.003.
Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):361-370, 2013. URL: http://dx.doi.org/10.1093/bioinformatics/btt215.
Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate texts. In Frank Drewes, Carlos Martín-Vide, and Bianca Truthe, editors, Proceedings of the 11th International Conference on Language and Automata Theory and Applications (LATA 2017), volume 10168 of LNCS, pages 131-142. Springer International Publishing, 2017. URL: http://dx.doi.org/10.1007/978-3-319-53733-7_9.
Paul Julian Kersey, James E. Allen, Irina Armean, Sanjay Boddu, Bruce J. Bolt, Denise Carvalho-Silva, Mikkel Christensen, Paul Davis, Lee J. Falin, Christoph Grabmueller, Jay C. Humphrey, Arnaud Kerhornou, Julia Khobova, Naveen K. Aranganathan, Nicholas Langridge, Ernesto Lowy, Mark D. McDowall, Uma Maheswari, Michael Nuhn, Chuang Kee Ong, Bert Overduin, Michael Paulini, Helder Pedro, Emily Perry, Giulietta Spudich, Electra Tapanari, Brandon Walts, Gareth Williams, Marcela K. Tello-Ruiz, Joshua C. Stein, Sharon Wei, Doreen Ware, Daniel M. Bolser, Kevin L. Howe, Eugene Kulesha, Daniel Lawson, Gareth Maslen, and Daniel M. Staines. Ensembl genomes 2016: more genomes, more complexity. Nucleic Acids Res., 44(Database-Issue):574-580, 2016. URL: http://dx.doi.org/10.1093/nar/gkv1209.
Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: http://dx.doi.org/10.1137/0206024.
Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Proceedings of the 16th International Workshop on Algorithms in Bioinformatics (WABI 2016), volume 9838 of LNCS, pages 222-233. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-43681-4_18.
Joong Chae Na, Hyunjoon Kim, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment: A compressed index for similar strings. Theor. Comput. Sci., 638:159-170, 2016. URL: http://dx.doi.org/10.1016/j.tcs.2015.08.008.
Gonzalo Navarro and Alberto Ordóñez Pereira. Faster compressed suffix trees for repetitive collections. ACM J. Exp. Algorithmics, 21(1):1.8:1-1.8:38, 2016. URL: http://dx.doi.org/10.1145/2851495.
Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, 2002. URL: http://dx.doi.org/10.1017/cbo9781316135228.
Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian J. Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler, and Benedict Paten. Building a pan-genome reference for a population. J. Comput. Biol., 22(5):387-401, 2015. URL: http://dx.doi.org/10.1089/cmb.2014.0146.
Nadia Pisanti, Henry Soldano, Mathilde Carpentier, and Joël Pothier. A relational extension of the notion of motifs: Application to the common 3D protein substructures searching problem. J. Comput. Biol., 16(12):1635-1660, 2009. URL: http://dx.doi.org/10.1089/cmb.2008.0019.
Marie-France Sagot, Alain Viari, Joël Pothier, and Henry Soldano. Finding flexible patterns in a text: an application to three-dimensional molecular matching. Comput. Appl. Biosci., 11(1):59-70, 1995. URL: http://dx.doi.org/10.1093/bioinformatics/11.1.59.
Siavash Sheikhizadeh, M. Eric Schranz, Mehmet Akdel, Dick de Ridder, and Sandra Smit. Pantools: representation, storage and exploration of pan-genomic data. Bioinformatics, 32(17):487-493, 2016. URL: http://dx.doi.org/10.1093/bioinformatics/btw455.
Jouni Sirén. Indexing variation graphs. In Sándor Fekete and Vijaya Ramachandran, editors, Proceedings of the 19th Workshop on Algorithm Engineering and Experiments (ALENEX 2017), pages 13-27. SIAM, 2017. URL: http://dx.doi.org/10.1137/1.9781611974768.2.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, 2015. URL: http://dx.doi.org/10.1038/nature15393.
The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief. Bioinformatics, pages 1-18, 2016. URL: http://dx.doi.org/10.1093/bib/bbw089.

On-Line Pattern Matching on Similar Texts

Authors Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, Luca Versari

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message