Exploiting New Properties of String Net Frequency for Efficient Computation

Guo, Peaker; Eades, Patrick; Wirth, Anthony; Zobel, Justin

doi:10.4230/LIPIcs.CPM.2024.16

Abstract

Knowing which strings in a massive text are significant - that is, which strings are common and distinct from other strings - is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, single-nf, how to compute the net frequency of a given string of length m, in an input text of length n over an alphabet size σ. Second, all-nf, given length-n input text, how to report every string of positive net frequency (and its net frequency). Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has O(n) construction cost: with this structure, we solve single-nf in O(m + σ) time and all-nf in O(n) time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for single-nf. For all-nf, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method. All in all, we show that net frequency is a cogent method for identifying significant strings. We show how to calculate net frequency efficiently, and how to report efficiently the set of plausibly significant strings.

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. URL: https://doi.org/10.1016/S1570-8667(03)00065-0.
Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane. Bi-directional r-indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 11:1-11:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPICS.CPM.2022.11.
Djamal Belazzougui and Fabio Cunial. Smaller fully-functional bidirectional BWT indexes. In String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 42-59. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_4.
Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler Transform. In Algorithms - ESA 2013 - 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings, volume 8125 of Lecture Notes in Computer Science, pages 133-144. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-40450-4_12.
Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, James Ostell, Kim D. Pruitt, and Eric W. Sayers. Genbank. Nucleic Acids Research, 46(Database-Issue):D41-D47, 2018. URL: https://doi.org/10.1093/nar/gkx1094.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17(1):8:1-8:39, 2021. URL: https://doi.org/10.1145/3426473.
Larry J. Cummings, D. Moore, and J. Karhumäki. Borders of Fibonacci strings. Journal of Combinatorial Mathematics and Combinatorial Computing, 20:81-88, 1996.
Aldo de Luca. A combinatorial property of the Fibonacci words. Information Processing Letters, 12(4):193-195, 1981. URL: https://doi.org/10.1016/0020-0190(81)90099-5.
Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465-492, 2011. URL: https://doi.org/10.1137/090779759.
Travis Gagie, Juha Kärkkäinen, Gonzalo Navarro, and Simon J. Puglisi. Colored range queries and document retrieval. Theoretical Computer Science, 483:36-50, 2013. URL: https://doi.org/10.1016/j.tcs.2012.08.004.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Experimental Algorithms - 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014. Proceedings, volume 8504 of Lecture Notes in Computer Science, pages 326-337. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Simon Gog and Enno Ohlebusch. Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics, 18, 2013. URL: https://doi.org/10.1145/2444016.2461327.
Costas S. Iliopoulos, Dennis W. G. Moore, and William F. Smyth. A characterization of the squares in a Fibonacci string. Theoretical Computer Science, 172(1-2):281-291, 1997. URL: https://doi.org/10.1016/S0304-3975(96)00141-7.
Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Factorizing strings into repetitions. Theory of Computing Systems, 66(2):484-501, 2022. URL: https://doi.org/10.1007/S00224-022-10070-3.
Juha Kärkkäinen, Dominik Kempa, and Marcin Piatkowski. Tighter bounds for the sum of irreducible LCP values. Theoretical Computer Science, 656:265-278, 2016. URL: https://doi.org/10.1016/j.tcs.2015.12.009.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings, volume 5577 of Lecture Notes in Computer Science, pages 181-192. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02441-2_17.
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, volume 2089 of Lecture Notes in Computer Science, pages 181-192. Springer, 2001. URL: https://doi.org/10.1007/3-540-48194-X_17.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 1002-1013. IEEE, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
Kaisei Kishi, Yuto Nakashima, and Shunsuke Inenaga. Largest repetition factorization of Fibonacci words. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 284-296. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_23.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023. URL: https://doi.org/10.1109/TIT.2022.3224382.
M. Oguzhan Külekci, Jeffrey Scott Vitter, and Bojian Xu. Efficient maximal repeat finding using the Burrows-Wheeler Transform and wavelet tree. IEEE ACM Trans. Comput. Biol. Bioinform., 9(2):421-429, 2012. URL: https://doi.org/10.1109/TCBB.2011.127.
Tak Wah Lam, Ruiqiang Li, Alan Tam, Simon C. K. Wong, Edward Wu, and Siu-Ming Yiu. High throughput short read alignment via bi-directional BWT. In 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, Washington, DC, USA, November 1-4, 2009, Proceedings, pages 31-36. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/BIBM.2009.42.
Yih-Jeng Lin and Ming-Shing Yu. Extracting Chinese frequent strings without dictionary from a Chinese corpus and its applications. Journal of Information Science and Engineering, 17(5):805-824, 2001. URL: https://jise.iis.sinica.edu.tw/JISESearch/pages/View/PaperView.jsf?keyId=86_1308.
Yih-Jeng Lin and Ming-Shing Yu. The properties and further applications of Chinese frequent strings. International Journal of Computational Linguistics and Chinese Language Processing, 9(1), 2004. URL: http://www.aclclp.org.tw/clclp/v9n1/v9n1a7.pdf.
M. Lothaire. Combinatorics on words, Second Edition. Cambridge mathematical library. Cambridge University Press, 1997.
Moritz G. Maaß. Linear bidirectional on-line construction of affix trees. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, volume 1848 of Lecture Notes in Computer Science, pages 320-334. Springer, 2000. URL: https://doi.org/10.1007/3-540-45123-4_27.
Udi Manber and Eugene W. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Giovanni Manzini. Two space saving tricks for linear time LCP array computation. In Algorithm Theory - SWAT 2004, 9th Scandinavian Workshop on Algorithm Theory, Humlebaek, Denmark, July 8-10, 2004, Proceedings, volume 3111 of Lecture Notes in Computer Science, pages 372-383. Springer, 2004. URL: https://doi.org/10.1007/978-3-540-27810-8_32.
Burrows Michael and Wheeler David. A block-sorting lossless data compression algorithm. In Digital SRC Research Report, 1994.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 657-666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Computing Surveys, 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Computing Surveys, 54(2):26:1-26:32, 2022. URL: https://doi.org/10.1145/3432999.
Julian Pape-Lange. On extensions of maximal repeats in compressed strings. In 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 27:1-27:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPICS.CPM.2020.27.
Giuseppe Pirillo. Fibonacci numbers and words. Discrete Mathematics, 173(1-3):197-207, 1997. URL: https://doi.org/10.1016/S0012-365X(94)00236-C.
Mathieu Raffinot. On maximal repeats in strings. Information Processing Letters, 80(3):165-169, 2001. URL: https://doi.org/10.1016/S0020-0190(01)00152-1.
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013. URL: https://doi.org/10.1007/s00453-012-9618-6.
Ellen M. Voorhees. Overview of TREC 2003. In Proceedings of The Twelfth Text REtrieval Conference, TREC 2003, Gaithersburg, Maryland, USA, November 18-21, 2003, volume 500-255 of NIST Special Publication, pages 1-13. National Institute of Standards and Technology (NIST), 2003. URL: http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf.

Exploiting New Properties of String Net Frequency for Efficient Computation

Authors Peaker Guo , Patrick Eades , Anthony Wirth , Justin Zobel

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Exploiting New Properties of String Net Frequency for Efficient Computation

Authors Peaker Guo , Patrick Eades , Anthony Wirth , Justin Zobel

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message