String Inference from Longest-Common-Prefix Array

Authors Juha Kärkkäinen, Marcin Piatkowski, Simon J. Puglisi



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2017.62.pdf
  • Filesize: 0.53 MB
  • 14 pages

Document Identifiers

Author Details

Juha Kärkkäinen
Marcin Piatkowski
Simon J. Puglisi

Cite As Get BibTex

Juha Kärkkäinen, Marcin Piatkowski, and Simon J. Puglisi. String Inference from Longest-Common-Prefix Array. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 62:1-62:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.ICALP.2017.62

Abstract

The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suffixes of a string. Together the two arrays are roughly equivalent to the suffix tree with the LCP array representing the tree shape.

In order to better understand the combinatorics of LCP arrays, we consider the problem of inferring a string from an LCP array, i.e., determining whether a given array of integers is a valid LCP array, and if it is, reconstructing some string or all strings with that LCP array. There are recent studies of inferring a string from a suffix tree shape but using significantly more information (in the form of suffix links) than is available in the LCP array.

We provide two main results. (1) We describe two algorithms for inferring strings from an LCP array when we allow a generalized form of LCP array defined for a multiset of cyclic strings: a linear time algorithm for binary alphabet and a general algorithm with polynomial time complexity for a constant alphabet size. (2) We prove that determining whether a given integer array is a valid LCP array is NP-complete when we require more restricted forms of LCP array defined for a single cyclic or non-cyclic string or a multiset of non-cyclic strings. The result holds whether or not the alphabet is restricted to be binary. In combination, the two results show that the generalized form of LCP array for a multiset of cyclic strings is fundamentally different from the other more restricted forms.

Subject Classification

Keywords
  • LCP array
  • string inference
  • BWT
  • suffix array
  • suffix tree
  • NP-hardness

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. URL: http://dx.doi.org/10.1016/S1570-8667(03)00065-0.
  2. Amihood Amir. Personal communication, String Masters in Rouen, France, 3-5 February, 2014. Google Scholar
  3. Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO Advanced Science Institutes Series F12, pages 85-96. Springer-Verlag, 1985. URL: http://dx.doi.org/10.1007/978-3-642-82456-2_6.
  4. Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, and S. Muthukrishnan. 40 years of suffix trees. Communications of the ACM, 59(4):66-73, 2016. URL: http://dx.doi.org/10.1145/2810036.
  5. Hideo Bannai, Shunsuke Inenaga, Ayumi Shinohara, and Masayuki Takeda. Inferring strings from graphs and arrays. In Proceedings of Mathematical Foundations of Computer Science 2003, volume 2747 of Lecture Notes in Computer Science, pages 208-217. Springer, 2003. URL: http://dx.doi.org/10.1007/978-3-540-45138-9_15.
  6. Bastien Cazaux and Eric Rivals. Reverse engineering of compact suffix trees and links: A novel algorithm. Journal of Discrete Algorithms, 28:9-22, 2014. URL: http://dx.doi.org/10.1016/j.jda.2014.07.002.
  7. Julien Clément, Maxime Crochemore, and Giuseppina Rindone. Reverse engineering prefix tables. In Proceedings of 26th International Symposium on Theoretical Aspects of Computer Science, STACS 2009, volume 3 of Leibniz International Proceedings in Informatics, pages 289-300, 2009. URL: http://dx.doi.org/10.4230/LIPIcs.STACS.2009.1825.
  8. Maxime Crochemore and Lucian Ilie. Computing longest previous factor in linear time and applications. Information Processing Letters, 106(2):75-80, 2008. URL: http://dx.doi.org/10.1016/j.ipl.2007.10.006.
  9. Maxime Crochemore, Costas S. Iliopoulos, Solon P. Pissis, and German Tischler. Cover array string reconstruction. In Proceeding of 21st Annual Symposium on Combinatorial Pattern Matching, CPM 2010, volume 6129 of Lecture Notes in Computer Science, pages 251-259. Springer, 2010. URL: http://dx.doi.org/10.1007/978-3-642-13509-5_23.
  10. Jean-Pierre Duval, Thierry Lecroq, and Arnaud Lefebvre. Border array on bounded alphabet. Journal of Automata, Languages and Combinatorics, 10(1):51-60, 2005. Google Scholar
  11. Jean-Pierre Duval, Thierry Lecroq, and Arnaud Lefebvre. Efficient validation and construction of border arrays and validation of string matching automata. RAIRO Theoretical Informatics and Applications, 43(2):281-297, 2009. URL: http://dx.doi.org/10.1051/ita:2008030.
  12. Frantisĕk Franĕk, Shudi Gao, Weilin Lu, Patrick J. Ryan, William F. Smyth, Yu Sun, and Lu Yang. Verifying a border array in linear time. Journal on Combinatorial Mathematics and Combinatorial Computing, 42:223-236, 2002. URL: http://dx.doi.org/10.1.1.32.5012.
  13. Pawel Gawrychowski, Artur Jez, and Lukasz Jez. Validating the Knuth-Morris-Pratt failure function, fast and online. Theory of Computing Systems, 54(2):337-372, 2014. URL: http://dx.doi.org/10.1007/s00224-013-9522-8.
  14. Ira M. Gessel and Christophe Reutenauer. Counting permutations with given cycle structure and descent set. Journal of Combinatorial Theory, Series A, 64(2):189-215, 1993. URL: http://dx.doi.org/10.1016/0097-3165(93)90095-P.
  15. Dan Gusfield. Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge, United Kingdom, 1997. URL: http://dx.doi.org/10.1017/CBO9780511574931.
  16. Jing He, Hongyu Liang, and Guang Yang. Reversing longest previous factor tables is hard. In Proceedings of 12th International Symposium on Algorithms and Data Structures, WADS 2011, volume 6844 of Lecture Notes in Computer Science, pages 488-499. Springer, 2011. URL: http://dx.doi.org/10.1007/978-3-642-22300-6_41.
  17. Peter M. Higgins. Burrows-Wheeler transformations and de Bruijn words. Theoretical Computer Science, 457:128-136, 2012. URL: http://dx.doi.org/10.1016/j.tcs.2012.07.019.
  18. Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Verifying and enumerating parameterized border arrays. Theoretical Computer Science, 412(50):6959-6981, 2011. URL: http://dx.doi.org/10.1016/j.tcs.2011.09.008.
  19. Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Inferring strings from suffix trees and links on a binary alphabet. Discrete Applied Mathematics, 163:316-325, 2014. URL: http://dx.doi.org/10.1016/j.dam.2013.02.033.
  20. Juha Kärkkäinen, Dominik Kempa, and Marcin Pia̧tkowski. Tighter bounds for the sum of irreducible LCP values. Theoretical Computer Science, 656:265-278, 2015. URL: http://dx.doi.org/10.1016/j.tcs.2015.12.009.
  21. Juha Kärkkäinen, Marcin Piatkowski, and Simon J. Puglisi. String inference from the LCP array. CoRR, abs/1606.04573, 2016. URL: http://arxiv.org/abs/1606.04573.
  22. Gregory Kucherov, Lilla Tóthmérész, and Stéphane Vialette. On the combinatorics of suffix arrays. Information Processing Letters, 113(22-24):915-920, 2013. URL: http://dx.doi.org/10.1016/j.ipl.2013.09.009.
  23. Udi Manber and Gene W. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. URL: http://dx.doi.org/10.1137/0222058.
  24. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. URL: http://dx.doi.org/10.1016/j.tcs.2007.07.014.
  25. Yuto Nakashima, Takashi Okabe, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Inferring strings from Lyndon factorization. In Proceedings of Mathematical Foundations of Computer Science 2014, Part II, volume 8635 of Lecture Notes in Computer Science, pages 565-576. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-662-44465-8_48.
  26. Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  27. Nicolas Philippe. Caractérisation et énumération des arbres compacts des suffixes. Master’s thesis, Université de Rouen, 2007. Google Scholar
  28. Klaus-Bernd Schürmann and Jens Stoye. Counting suffix arrays and strings. Theoretical Computer Science, 395(2-3):220-234, 2008. URL: http://dx.doi.org/10.1016/j.tcs.2008.01.011.
  29. Imre Simon. Piecewise testable events. In Proceedings of 2nd GI Conference on Automata Theory and Formal Languages, volume 33 of Lecture Notes in Computer Science, pages 214-222. Springer, 1975. URL: http://dx.doi.org/10.1007/3-540-07407-4_23.
  30. Bill Smyth. Computing Patterns in Strings. Pearson Addison-Wesley, Essex, England, 2003. Google Scholar
  31. Tatiana A. Starikovskaya and Hjalte Wedel Vildhøj. A suffix tree or not a suffix tree? Journal of Discrete Algorithms, 32:14-23, 2015. URL: http://dx.doi.org/10.1016/j.jda.2015.01.005.
  32. Peter Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory 1973, pages 1-11. IEEE Computer Society, 1973. URL: http://dx.doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail