Search Results

Documents authored by Guo, Peaker


Document
Net Occurrences in Fibonacci and Thue-Morse Words

Authors: Peaker Guo and Kaisei Kishi

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)


Abstract
A net occurrence of a repeated string in a text is an occurrence with unique left and right extensions, and the net frequency of the string is the number of its net occurrences in the text. Originally introduced for applications in Natural Language Processing, net frequency has recently gained attention for its algorithmic aspects. Guo et al. [CPM 2024] and Ohlebusch et al. [SPIRE 2024] focus on its computation in the offline setting, while Guo et al. [SPIRE 2024], Inenaga [arXiv 2024], and Mieno and Inenaga [CPM 2025] tackle the online counterpart. Mieno and Inenaga also characterize net occurrences in terms of the minimal unique substrings of the text. Additionally, Guo et al. [CPM 2024] initiate the study of net occurrences in Fibonacci words to establish a lower bound on the asymptotic running time of algorithms. Although there has been notable progress in algorithmic developments and some initial combinatorial insights, the combinatorial aspects of net occurrences have yet to be thoroughly examined. In this work, we make two key contributions. First, we confirm the conjecture that each Fibonacci word contains exactly three net occurrences. Second, we show that each Thue-Morse word contains exactly nine net occurrences. To achieve these results, we introduce the notion of overlapping net occurrence cover, which narrows down the candidate net occurrences in any text. Furthermore, we provide a precise characterization of occurrences of Fibonacci and Thue-Morse words of smaller order, offering structural insights that may have independent interest and potential applications in algorithm analysis and combinatorial properties of these words.

Cite as

Peaker Guo and Kaisei Kishi. Net Occurrences in Fibonacci and Thue-Morse Words. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 16:1-16:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{guo_et_al:LIPIcs.CPM.2025.16,
  author =	{Guo, Peaker and Kishi, Kaisei},
  title =	{{Net Occurrences in Fibonacci and Thue-Morse Words}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{16:1--16:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.16},
  URN =		{urn:nbn:de:0030-drops-231107},
  doi =		{10.4230/LIPIcs.CPM.2025.16},
  annote =	{Keywords: Fibonacci words, Thue-Morse words, net occurrence, net frequency, factorization}
}
Document
Exploiting New Properties of String Net Frequency for Efficient Computation

Authors: Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
Knowing which strings in a massive text are significant - that is, which strings are common and distinct from other strings - is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, single-nf, how to compute the net frequency of a given string of length m, in an input text of length n over an alphabet size σ. Second, all-nf, given length-n input text, how to report every string of positive net frequency (and its net frequency). Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has O(n) construction cost: with this structure, we solve single-nf in O(m + σ) time and all-nf in O(n) time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for single-nf. For all-nf, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method. All in all, we show that net frequency is a cogent method for identifying significant strings. We show how to calculate net frequency efficiently, and how to report efficiently the set of plausibly significant strings.

Cite as

Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel. Exploiting New Properties of String Net Frequency for Efficient Computation. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{guo_et_al:LIPIcs.CPM.2024.16,
  author =	{Guo, Peaker and Eades, Patrick and Wirth, Anthony and Zobel, Justin},
  title =	{{Exploiting New Properties of String Net Frequency for Efficient Computation}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.16},
  URN =		{urn:nbn:de:0030-drops-201265},
  doi =		{10.4230/LIPIcs.CPM.2024.16},
  annote =	{Keywords: Fibonacci words, suffix arrays, Burrows-Wheeler transform, LCP arrays, irreducible LCP values, coloured range listing}
}
Any Issues?
X

Feedback on the Current Page

CAPTCHA

Thanks for your feedback!

Feedback submitted to Dagstuhl Publishing

Could not send message

Please try again later or send an E-mail