Lyndon Arrays in Sublinear Time

Authors Hideo Bannai , Jonas Ellert



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.14.pdf
  • Filesize: 0.87 MB
  • 16 pages

Document Identifiers

Author Details

Hideo Bannai
  • M&D Data Science Center, Tokyo Medical and Dental University, Japan
Jonas Ellert
  • Technical University of Dortmund, Germany

Cite AsGet BibTex

Hideo Bannai and Jonas Ellert. Lyndon Arrays in Sublinear Time. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 14:1-14:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.14

Abstract

A Lyndon word is a string that is lexicographically smaller than all of its non-trivial suffixes. For example, airbus is a Lyndon word, but amtrak is not a Lyndon word due to its suffix ak. The Lyndon array stores the length of the longest Lyndon prefix of each suffix of a string. For a length-n string over a general ordered alphabet, the array can be computed in O(n) time (Bille et al., ICALP 2020). However, on a word-RAM of word-width w ≥ log₂ n, linear time is not optimal if the string is over integer alphabet {0, … , σ} with σ ≪ n. In this case, the string can be stored in O(n log σ) bits (or O(n / log_σ n) words) of memory, and reading it takes only O(n / log_σ n) time. We show that O(n / log_σ n) time and words of space suffice to compute the succinct 2n-bit version of the Lyndon array. The time is optimal for w = O(log n). The algorithm uses precomputed lookup tables to perform significant parts of the computation in constant time. This is possible due to properties of periodic substrings, which we carefully analyze to achieve the desired result. We envision that the algorithm has applications in the computation of runs (maximal periodic substrings), where the Lyndon array plays a central role in both theoretically and practically fast algorithms.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Lyndon forest
  • Lyndon table
  • Lyndon array
  • sublinear time algorithms
  • word RAM algorithms
  • word packing
  • tabulation
  • lookup tables
  • periodicity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Maxim Babenko, Pawel Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meet suffix trees. In Proceedings of the 26th Annual Symposium on Discrete Algorithms (SODA 2015), pages 572-591, San Diego, CA, USA, January 2015. URL: https://doi.org/10.1137/1.9781611973730.39.
  2. Golnaz Badkobeh, Maxime Crochemore, Jonas Ellert, and Cyril Nicaud. Back-to-front online Lyndon forest construction. In Proceedings of the 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), pages 13:1-13:23, Prague, Czech Republic, June 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.13.
  3. Uwe Baier. Linear-time Suffix Sorting - A New Approach for Suffix Array Construction. In Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), pages 23:1-23:12, Tel Aviv, Israel, June 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.23.
  4. Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
  5. Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In Proceedings of the 29th Annual European Symposium on Algorithms (ESA 2021), pages 15:1-15:13, Lisbon, Portugal (Virtual Conference), September 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.15.
  6. Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, J. Ian Munro, and Eva Rotenberg. Space efficient construction of Lyndon arrays in linear time. In Proceedings of the 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020), pages 14:1-14:18, Saarbrücken, Germany (Virtual Conference), July 2020. URL: https://doi.org/10.4230/LIPIcs.ICALP.2020.14.
  7. P. Bonizzoni, M. Costantini, C. De Felice, A. Petescia, Y. Pirola, M. Previtali, R. Rizzi, J. Stoye, R. Zaccagnino, and R. Zizza. Numeric lyndon-based feature embedding of sequencing reads for machine learning approaches. Information Sciences, 607:458-476, 2022. URL: https://doi.org/10.1016/j.ins.2022.06.005.
  8. Paola Bonizzoni, Alessia Petescia, Yuri Pirola, Raffaella Rizzi, Rocco Zaccagnino, and Rosalba Zizza. Kfinger: Capturing overlaps between long reads by using lyndon fingerprints. In Proceedings of the International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO 2022), pages 436-449, Gran Canaria, Spain, June 2022. URL: https://doi.org/10.1007/978-3-031-07802-6_37.
  9. K. T. Chen, R. H. Fox, and R. C. Lyndon. Free differential calculus, iv. the quotient groups of the lower central series. Annals of Mathematics, 68(1):81-95, 1958. URL: https://doi.org/10.2307/1970044.
  10. Maxime Crochemore and Lucian Ilie. Maximal repetitions in strings. Journal of Computer and System Sciences, 74(5):796-807, 2008. URL: https://doi.org/10.1016/j.jcss.2007.09.003.
  11. Maxime Crochemore, Lucian Ilie, and Liviu Tinta. The “runs” conjecture. Theoretical Computer Science, 412(27):2931-2941, 2011. URL: https://doi.org/10.1016/j.tcs.2010.06.019.
  12. Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Ritu Kundu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Near-optimal computation of runs over general alphabet via non-crossing LCE queries. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE 2016), pages 22-34, Beppu, Japan, October 2016. URL: https://doi.org/10.1007/978-3-319-46049-9_3.
  13. Maxime Crochemore and Luís M. S. Russo. Cartesian and Lyndon trees. Theoretical Computer Science, 806:1-9, 2020. URL: https://doi.org/10.1016/j.tcs.2018.08.011.
  14. Yevgeniy Dodis, Mihai Pătraşcu, and Mikkel Thorup. Changing base without losing space. In Proceedings of the 42nd Symposium on Theory of Computing (STOC 2010), pages 593-602, Cambridge, MA, USA, June 2010. URL: https://doi.org/10.1145/1806689.1806771.
  15. Jean-Pierre Duval. Factorizing words over an ordered alphabet. Journal of Algorithms, 4(4):363-381, 1983. URL: https://doi.org/10.1016/0196-6774(83)90017-2.
  16. Jonas Ellert. Lyndon Arrays Simplified. In Proceedings of the 30th Annual European Symposium on Algorithms (ESA 2022), pages 48:1-48:14, Potsdam, Germany, September 2022. URL: https://doi.org/10.4230/LIPIcs.ESA.2022.48.
  17. Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), pages 63:1-63:16, Glasgow, Scotland (Virtual Conference), July 2021. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.63.
  18. N. J. Fine and H. S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.2307/2034009.
  19. Johannes Fischer. Optimal succinctness for range minimum queries. In Proceedings of the 9th Latin American Symposium on Theoretical Informatics (LATIN 2010), pages 158-169, Oaxaca, Mexico, April 2010. URL: https://doi.org/10.1007/978-3-642-12200-2_16.
  20. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM 2006), pages 36-48, Barcelona, Spain, July 2006. URL: https://doi.org/10.1007/11780441_5.
  21. Frantisek Franek, A. S. M. Sohidull Islam, Mohammad Sohel Rahman, and William F. Smyth. Algorithms to compute the Lyndon array. In Proceedings of the Prague Stringology Conference (PSC 2016), pages 172-184, Prague, Czech Republic, August 2016. URL: http://www.stringology.org/event/2016/p15.html.
  22. Pawel Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, and Tomasz Walen. Faster longest common extension queries in strings over general alphabets. In Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), pages 5:1-5:13, Tel Aviv, Israel, June 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.5.
  23. Torben Hagerup. Sorting and searching on the word ram. In Proceedings of the 15th Annual Symposium on Theoretical Aspects of Computer Science (STACS 1998), pages 366-398, Paris, France, February 1998. URL: https://doi.org/10.1007/BFb0028575.
  24. Christophe Hohlweg and Christophe Reutenauer. Lyndon words, permutations and trees. Theoretical Computer Science, 307(1):173-178, 2003. URL: https://doi.org/10.1016/S0304-3975(03)00099-9.
  25. Yusaku Kaneta. Fast wavelet tree construction in practice. In Proceedings of the 25th International Symposium on String Processing and Information Retrieval (SPIRE 2018), pages 218-232, Lima, Peru, October 2018. URL: https://doi.org/10.1007/978-3-030-00479-8_18.
  26. Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual Symposium on Theory of Computing (STOC 2019), pages 756-767, Phoenix, AZ, USA, June 2019. URL: https://doi.org/10.1145/3313276.3316368.
  27. Dominik Kempa and Tomasz Kociumaka. Breaking the o(n)-barrier in the construction of compressed suffix arrays and suffix trees. In Proceedings of the 34th Annual Symposium on Discrete Algorithms (SODA), pages 5122-5202, Florence, Italy, January 2023. URL: https://doi.org/10.1137/1.9781611977554.ch187.
  28. Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253-272, 2003. URL: https://doi.org/10.1016/S0304-3975(02)00426-7.
  29. Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS 1999), pages 596-604, New York, NY, USA, October 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
  30. Dmitry Kosolobov. Computing runs on a general alphabet. Information Processing Letters, 116(3):241-244, 2016. URL: https://doi.org/10.1016/j.ipl.2015.11.016.
  31. Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino, and Guilherme P. Telles. Inducing the Lyndon array. In Proceedings of the 26th International Symposium on String Processing and Information Retrieval (SPIRE 2019), pages 138-151, Segovia, Spain, October 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_10.
  32. R. C. Lyndon. On Burnside problem I. Transactions of the American Mathematical Society, 77(2):202-215, 1954. URL: https://doi.org/10.2307/1990868.
  33. Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st Annual Symposium on Discrete Algorithms (SODA 1990), pages 319-327, San Francisco, CA, USA, January 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218.
  34. Masamichi Miyazaki, Ayumi Shinohara, and Masayuki Takeda. An improved pattern matching algorithm for strings in terms of straight-line programs. In Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM 1997), pages 1-11, Aarhus, Denmark, June 1997. URL: https://doi.org/10.1007/3-540-63220-4_45.
  35. J. Ian Munro, Yakov Nekrich, and Jeffrey S. Vitter. Fast construction of wavelet trees. In Proceedings of the 21st International Symposium on String Processing and Information Retrieval (SPIRE 2014), Ouro Preto, Brazil, October 2014. URL: https://doi.org/10.1007/978-3-319-11918-2_10.
  36. J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing, 31(3):762-776, 2001. URL: https://doi.org/10.1137/S0097539799364092.
  37. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms, 10(3), May 2014. URL: https://doi.org/10.1145/2601073.
  38. Jannik Olbrich, Enno Ohlebusch, and Thomas Büchler. On the optimisation of the GSACA suffix array construction algorithm. In Proceedings of the 29th International Symposium on String Processing and Information Retrieval (SPIRE 2022), pages 99-113, Concepción, Chile, November 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_8.
  39. Simon J. Puglisi, Jamie Simpson, and W.F. Smyth. How many runs can a string contain? Theoretical Computer Science, 401(1):165-171, 2008. URL: https://doi.org/10.1016/j.tcs.2008.04.020.
  40. Wojciech Rytter. The number of runs in a string: Improved analysis of the linear upper bound. In Proceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science (STACS 2006), pages 184-195, Marseille, France, February 2006. URL: https://doi.org/10.1007/11672142_14.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail