Search Results

Documents authored by Guo, Peaker


Document
Exploiting New Properties of String Net Frequency for Efficient Computation

Authors: Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
Knowing which strings in a massive text are significant - that is, which strings are common and distinct from other strings - is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, single-nf, how to compute the net frequency of a given string of length m, in an input text of length n over an alphabet size σ. Second, all-nf, given length-n input text, how to report every string of positive net frequency (and its net frequency). Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has O(n) construction cost: with this structure, we solve single-nf in O(m + σ) time and all-nf in O(n) time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for single-nf. For all-nf, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method. All in all, we show that net frequency is a cogent method for identifying significant strings. We show how to calculate net frequency efficiently, and how to report efficiently the set of plausibly significant strings.

Cite as

Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel. Exploiting New Properties of String Net Frequency for Efficient Computation. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{guo_et_al:LIPIcs.CPM.2024.16,
  author =	{Guo, Peaker and Eades, Patrick and Wirth, Anthony and Zobel, Justin},
  title =	{{Exploiting New Properties of String Net Frequency for Efficient Computation}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.16},
  URN =		{urn:nbn:de:0030-drops-201265},
  doi =		{10.4230/LIPIcs.CPM.2024.16},
  annote =	{Keywords: Fibonacci words, suffix arrays, Burrows-Wheeler transform, LCP arrays, irreducible LCP values, coloured range listing}
}