Space-Efficient Online Computation of String Net Occurrences

Mieno, Takuya; Inenaga, Shunsuke

doi:10.4230/LIPIcs.CPM.2025.23

Abstract

A substring u of a string T is said to be a repeat if u occurs at least twice in T. An occurrence [i..j] of a repeat u in T is said to be a net occurrence if each of the substrings aub = T[i-1..j+1], au = T[i-1..j], and ub = T[i..j+1] occurs exactly once in T. The occurrence [i-1..j+1] of aub is said to be an extended net occurrence of u. Let T be an input string of length n over an alphabet of size σ, and let ENO(T) denote the set of extended net occurrences of repeats in T. Guo et al. [SPIRE 2024] presented an online algorithm which can report ENO(T[1..i]) in T[1..i] in O(nσ²) time, for each prefix T[1..i] of T. Very recently, Inenaga [arXiv 2024] gave a faster online algorithm that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where #S denotes the cardinality of a set S. Both of the aforementioned data structures can be maintained in O(n log σ) time and occupy O(n) space, where the O(n)-space requirement comes from the suffix tree data structure. In particular, Inenaga’s recent algorithm is based on Weiner’s right-to-left online suffix tree construction. In this paper, we show that one can modify Ukkonen’s left-to-right online suffix tree construction algorithm in O(n) space, so that ENO(T[1..i]) can be reported in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T. This is an improvement over Guo et al.’s method that is also based on Ukkonen’s algorithm. Further, this leads us to the two following space-efficient alternatives:  
- A sliding-window algorithm of O(d) working space that can report ENO(T[i-d+1..i]) in optimal O(#ENO(T[i-d+1..i])) time for each sliding window T[i-d+1..i] of size d in T. 
- A CDAWG-based online algorithm of O(𝖾) working space that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where 𝖾 < 2n is the number of edges in the CDAWG for T.  All of our proposed data structures can be maintained in O(n log σ) time for the input online string T. We also discuss that the extended net occurrences of repeats in T can be fully characterized in terms of the minimal unique substrings (MUSs) in T.

Cite As Get BibTex

Takuya Mieno and Shunsuke Inenaga. Space-Efficient Online Computation of String Net Occurrences. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/LIPIcs.CPM.2025.23

Author Details

Takuya Mieno

Department of Computer and Network Engineering, University of Electro-Communications, Chofu, Japan

Shunsuke Inenaga

Department of Informatics, Kyushu University, Fukuoka, Japan

Funding

Mieno, Takuya: JSPS KAKENHI Grant Number JP24K20734.
Inenaga, Shunsuke: JSPS KAKENHI Grant Numbers JP23K24808, JP23K18466, JP20H05964.

References

Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In CPM 2015, volume 9133 of Lecture Notes in Computer Science, pages 26-39. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_3.
Anselm Blumer, Janet Blumer, David Haussler, Ross M. McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578-595, 1987. URL: https://doi.org/10.1145/28869.28873.
Dany Breslauer and Giuseppe F. Italiano. On suffix extensions in suffix trees. Theor. Comput. Sci., 457:27-34, 2012. URL: https://doi.org/10.1016/J.TCS.2012.07.018.
Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, DIGITAL System Research Center, 1994.
Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel. Exploiting new properties of string net frequency for efficient computation. In CPM 2024, pages 16:1-16:16, 2024. URL: https://doi.org/10.4230/LIPICS.CPM.2024.16.
Peaker Guo, Seeun William Umboh, Anthony Wirth, and Justin Zobel. Online computation of string net frequency. In SPIRE 2024, volume 14899 of Lecture Notes in Computer Science, pages 159-173. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-72200-4_12.
Lucian Ilie and William F. Smyth. Minimum unique substrings and maximum repeats. Fundam. Informaticae, 110(1-4):183-195, 2011. URL: https://doi.org/10.3233/FI-2011-536.
Shunsuke Inenaga. Faster and simpler online computation of string net frequency. CoRR, abs/2410.06837, 2024. URL: https://doi.org/10.48550/arXiv.2410.06837.
Shunsuke Inenaga, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa, Giancarlo Mauri, and Giulio Pavesi. On-line construction of compact directed acyclic word graphs. Discret. Appl. Math., 146(2):156-179, 2005. URL: https://doi.org/10.1016/J.DAM.2004.04.012.
Shunsuke Inenaga, Takuya Mieno, Hiroki Arimura, Mitsuru Funakoshi, and Yuta Fujishige. Computing minimal absent words and extended bispecial factors with CDAWG space. In IWOCA 2024, volume 14764 of Lecture Notes in Computer Science, pages 327-340. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-63021-7_25.
N. Jesper Larsson. Extended application of suffix trees to data compression. In DCC 1996, pages 190-199. IEEE Computer Society, 1996. URL: https://doi.org/10.1109/DCC.1996.488324.
Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai, and Takuya Mieno. Sliding suffix trees simplified. CoRR, abs/2307.01412, 2023. URL: https://doi.org/10.48550/arXiv.2307.01412.
Yih-Jeng Lin and Ming-Shing Yu. Extracting Chinese frequent strings without dictionary from a Chinese corpus, its applications. J. Inf. Sci. Eng., 17(5):805-824, 2001. URL: http://www.iis.sinica.edu.tw/page/jise/2001/200109_07.html.
Yih-Jeng Lin and Ming-Shing Yu. The properties and further applications of Chinese frequent strings. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 9, Number 1, February 2004: Special Issue on Selected Papers from ROCLING XV, pages 113-128, 2004.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Takuya Mieno, Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Computing minimal unique substrings for a sliding window. Algorithmica, 84(3):670-693, 2022. URL: https://doi.org/10.1007/S00453-021-00864-1.
Takaaki Nishimoto and Yasuo Tabei. R-enum: Enumeration of characteristic substrings in bwt-runs bounded space. In CPM 2021, volume 191 of LIPIcs, pages 21:1-21:21, 2021. URL: https://doi.org/10.4230/LIPICS.CPM.2021.21.
Enno Ohlebusch, Thomas Büchler, and Jannik Olbrich. Faster computation of Chinese frequent strings and their net frequencies. In SPIRE 2024, volume 14899 of Lecture Notes in Computer Science, pages 249-256. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-72200-4_19.
Jakub Radoszewski and Wojciech Rytter. On the structure of compacted subword graphs of Thue-Morse words and their applications. J. Discrete Algorithms, 11:15-24, 2012. URL: https://doi.org/10.1016/J.JDA.2011.01.001.
Wojciech Rytter. The structure of subword graphs and suffix trees of Fibonacci words. Theor. Comput. Sci., 363(2):211-223, 2006. URL: https://doi.org/10.1016/J.TCS.2006.07.025.
Martin Senft. Suffix tree for a sliding window: An overview. In WDS 2005, volume 5, pages 41-46, 2005.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.

Space-Efficient Online Computation of String Net Occurrences

Authors Takuya Mieno , Shunsuke Inenaga

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message