On Strings Having the Same Length- k Substrings

Authors Giulia Bernardini , Alessio Conte, Esteban Gabory, Roberto Grossi, Grigorios Loukides , Solon P. Pissis , Giulia Punzi, Michelle Sweering



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.16.pdf
  • Filesize: 0.9 MB
  • 17 pages

Document Identifiers

Author Details

Giulia Bernardini
  • University of Trieste, Italy
  • CWI, Amsterdam, The Netherlands
Alessio Conte
  • University of Pisa, Italy
Esteban Gabory
  • CWI, Amsterdam, The Netherlands
Roberto Grossi
  • University of Pisa, Italy
Grigorios Loukides
  • King’s College London, UK
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Giulia Punzi
  • University of Pisa, Italy
Michelle Sweering
  • CWI, Amsterdam, The Netherlands

Cite As Get BibTex

Giulia Bernardini, Alessio Conte, Esteban Gabory, Roberto Grossi, Grigorios Loukides, Solon P. Pissis, Giulia Punzi, and Michelle Sweering. On Strings Having the Same Length- k Substrings. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 16:1-16:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.CPM.2022.16

Abstract

Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest 𝒮_k-Equivalent Strings: Given a set 𝒮_k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = 𝒮_k, for all i ∈ [1,z]. The z-Shortest 𝒮_k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest 𝒮_k-Equivalent Strings, referred to as Shortest 𝒮_k-Equivalent String, asks for a shortest string X such that Substr_k(X) = 𝒮_k. 
Our main contributions are summarized below:  
- Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in 𝒪̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest 𝒮_k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. 
- We show that the length of a shortest string output by Shortest 𝒮_k-Equivalent String is in 𝒪(k+n²). We generalize this bound by showing that the total length of z shortest strings is in 𝒪(zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. 
- We present an algorithm for solving z-Shortest 𝒮_k-Equivalent Strings in 𝒪(nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes 𝒪(nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is 𝒪(k+n²).

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • string algorithms
  • combinatorics on words
  • de Bruijn graph
  • Chinese Postman

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Laura Barker, Pamela Fleischmann, Katharina Harwardt, Florin Manea, and Dirk Nowotka. Scattered factor-universality of words. In Developments in Language Theory - 24th International Conference, volume 12086 of Lecture Notes in Computer Science, pages 14-28. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-48516-0_2.
  2. Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe data structures for text indexing. In Proceedings of the Symposium on Algorithm Engineering and Experiments, ALENEX, pages 199-213, 2020. URL: https://doi.org/10.1137/1.9781611976007.16.
  3. Giulia Bernardini, Huiping Chen, Gabriele Fici, Grigorios Loukides, and Solon P. Pissis. Reverse-safe text indexing. ACM J. Exp. Algorithmics, 26, 2021. URL: https://doi.org/10.1145/3461698.
  4. Bastien Cazaux, Thierry Lecroq, and Eric Rivals. Linking indexing data structures to de Bruijn graphs: Construction and update. J. Comput. Syst. Sci., 104:165-183, 2019. URL: https://doi.org/10.1016/j.jcss.2016.06.008.
  5. Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giulia Punzi. Beyond the BEST theorem: Fast assessment of Eulerian trails. In Fundamentals of Computation Theory - 23rd International Symposium,, volume 12867 of Lecture Notes in Computer Science, pages 162-175. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-86593-1_11.
  6. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007. Google Scholar
  7. Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph - overview. J. Discrete Algorithms, 1(3-4):255-280, 2003. URL: https://doi.org/10.1016/S1570-8667(03)00029-7.
  8. Jack R. Edmonds and Ellis L. Johnson. Matching, Euler tours and the Chinese Postman. Math. Program., 5(1):88-124, 1973. URL: https://doi.org/10.1007/BF01580113.
  9. Leonhard Euler. Solutio problematis ad geometriam situs pertinentis. Euler Archive - All Works, 53:15, 1741. Google Scholar
  10. Lukas Fleischer and Manfred Kufleitner. Testing Simon’s congruence. In 43rd International Symposium on Mathematical Foundations of Computer Science, volume 117 of LIPIcs, pages 62:1-62:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.MFCS.2018.62.
  11. Emmanuelle Garel. Minimal separators of two words. In Combinatorial Pattern Matching, 4th Annual Symposium, volume 684 of Lecture Notes in Computer Science, pages 35-53. Springer, 1993. URL: https://doi.org/10.1007/BFb0029795.
  12. Pawel Gawrychowski, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Efficiently testing Simon’s congruence. In 38th International Symposium on Theoretical Aspects of Computer Science,, volume 187 of LIPIcs, pages 34:1-34:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.STACS.2021.34.
  13. Andrew V. Goldberg and Robert Endre Tarjan. Solving minimum-cost flow problems by successive approximation. In Alfred V. Aho, editor, Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pages 7-18. ACM, 1987. URL: https://doi.org/10.1145/28395.28397.
  14. Horst W. Hamacher. A note on K best network flows. Ann. Oper. Res., 57(1):65-72, 1995. URL: https://doi.org/10.1007/BF02099691.
  15. Jean-Jacques Hébrard. An algorithm for distinguishing efficiently bit-strings by their subsequences. Theor. Comput. Sci., 82(1):35-49, 1991. URL: https://doi.org/10.1016/0304-3975(91)90170-7.
  16. Joan P. Hutchinson. On words with prescribed overlapping subsequences. Utilitas Mathematica, 7:241-250, 1975. Google Scholar
  17. Joan P. Hutchinson and Herbert S. Wilf. On Eulerian circuits and words with prescribed adjacency patterns. J. Comb. Theory, Ser. A, 18(1):80-87, 1975. URL: https://doi.org/10.1016/0097-3165(75)90068-0.
  18. Prateek Karandikar, Manfred Kufleitner, and Philippe Schnoebelen. On the index of Simon’s congruence for piecewise testability. Inf. Process. Lett., 115(4):515-519, 2015. URL: https://doi.org/10.1016/j.ipl.2014.11.008.
  19. Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages with applications in logical complexity. In 25th EACSL Annual Conference on Computer Science Logic, volume 62 of LIPIcs, pages 37:1-37:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.CSL.2016.37.
  20. Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Log. Methods Comput. Sci., 15(2), 2019. URL: https://doi.org/10.23638/LMCS-15(2:6)2019.
  21. Juhani Karhumäki, Svetlana Puzynina, Michaël Rao, and Markus A. Whiteland. On cardinalities of k-abelian equivalence classes. Theor. Comput. Sci., 658:190-204, 2017. URL: https://doi.org/10.1016/j.tcs.2016.06.010.
  22. Carl Kingsford, Michael C. Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinform., 11:21, 2010. URL: https://doi.org/10.1186/1471-2105-11-21.
  23. David Könen, Daniel R. Schmidt, and Christiane Spisla. Finding all minimum cost flows and a faster algorithm for the K best flow problem. CoRR, abs/2105.10225, 2021. URL: http://arxiv.org/abs/2105.10225.
  24. Kazuhiro Kurita and Kunihiro Wasa. Constant amortized time enumeration of Eulerian trails. CoRR, abs/2101.10473, 2021. URL: http://arxiv.org/abs/2101.10473.
  25. M. Lothaire. Combinatorics on Words. Cambridge Mathematical Library. Cambridge University Press, 2nd edition, 1997. URL: https://doi.org/10.1017/CBO9780511566097.
  26. Alessio Orlandi and Rossano Venturini. Space-efficient substring occurrence estimation. Algorithmica, 74(1):65-90, 2016. URL: https://doi.org/10.1007/s00453-014-9936-y.
  27. James B. Orlin. A faster strongly polynomial minimum cost flow algorithm. Oper. Res., 41(2):338-350, 1993. URL: https://doi.org/10.1287/opre.41.2.338.
  28. James B. Orlin. A polynomial time primal network simplex algorithm for minimum cost flows. Math. Program., 77:109-129, 1997. URL: https://doi.org/10.1007/BF02614365.
  29. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci, 98(17):9748-9753, 2001. URL: https://doi.org/10.1073/pnas.171285098.
  30. Antonio Sedeño-Noda and Juan José Espino-Martín. On the K best integer network flows. Comput. Oper. Res., 40(2):616-626, 2013. URL: https://doi.org/10.1016/j.cor.2012.08.014.
  31. Imre Simon. Piecewise testable events. In Automata Theory and Formal Languages, 2nd GI Conference, volume 33 of Lecture Notes in Computer Science, pages 214-222. Springer, 1975. URL: https://doi.org/10.1007/3-540-07407-4_23.
  32. Robert Endre Tarjan. Dynamic trees as search trees via Euler tours, applied to the network simplex algorithm. Math. Program., 77:169-177, 1997. URL: https://doi.org/10.1007/BF02614369.
  33. Zdenek Tronícek. Common subsequence automaton. In Implementation and Application of Automata, 7th International Conference, volume 2608 of Lecture Notes in Computer Science, pages 270-275. Springer, 2002. URL: https://doi.org/10.1007/3-540-44977-9_28.
  34. Tatyana van Aardenne-Ehrenfest and Nicolaas G. de Bruijn. Circuits and trees in oriented linear graphs. In Ira Gessel and Gian-Carlo Rota, editors, Classic Papers in Combinatorics, pages 149-163. Birkhäuser Boston, Boston, MA, 1987. URL: https://doi.org/10.1007/978-0-8176-4842-8_12.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail