Search Results

Documents authored by Zobel, Justin


Document
Exploiting New Properties of String Net Frequency for Efficient Computation

Authors: Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
Knowing which strings in a massive text are significant - that is, which strings are common and distinct from other strings - is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, single-nf, how to compute the net frequency of a given string of length m, in an input text of length n over an alphabet size σ. Second, all-nf, given length-n input text, how to report every string of positive net frequency (and its net frequency). Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has O(n) construction cost: with this structure, we solve single-nf in O(m + σ) time and all-nf in O(n) time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for single-nf. For all-nf, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method. All in all, we show that net frequency is a cogent method for identifying significant strings. We show how to calculate net frequency efficiently, and how to report efficiently the set of plausibly significant strings.

Cite as

Peaker Guo, Patrick Eades, Anthony Wirth, and Justin Zobel. Exploiting New Properties of String Net Frequency for Efficient Computation. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{guo_et_al:LIPIcs.CPM.2024.16,
  author =	{Guo, Peaker and Eades, Patrick and Wirth, Anthony and Zobel, Justin},
  title =	{{Exploiting New Properties of String Net Frequency for Efficient Computation}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.16},
  URN =		{urn:nbn:de:0030-drops-201265},
  doi =		{10.4230/LIPIcs.CPM.2024.16},
  annote =	{Keywords: Fibonacci words, suffix arrays, Burrows-Wheeler transform, LCP arrays, irreducible LCP values, coloured range listing}
}
Document
Recency Queries with Succinct Representation

Authors: William L. Holland, Anthony Wirth, and Justin Zobel

Published in: LIPIcs, Volume 181, 31st International Symposium on Algorithms and Computation (ISAAC 2020)


Abstract
In the context of the sliding-window set membership problem, and caching policies that require knowledge of item recency, we formalize the problem of Recency on a stream. Informally, the query asks, "when was the last time I saw item x?" Existing structures, such as hash tables, can support a recency query by augmenting item occurrences with timestamps. To support recency queries on a window of W items, this might require Θ(W log W) bits. We propose a succinct data structure for Recency. By combining sliding-window dictionaries in a hierarchical structure, and careful design of the underlying hash tables, we achieve a data structure that returns a 1+ε approximation to the recency of every item in O(log(ε W)) time, in only (1+o(1))(1+ε)(ℬ+Wlog(ε^(-1))) bits. Here, ℬ is the information-theoretic lower bound on the number of bits for a set of size W, in a universe of cardinality N.

Cite as

William L. Holland, Anthony Wirth, and Justin Zobel. Recency Queries with Succinct Representation. In 31st International Symposium on Algorithms and Computation (ISAAC 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 181, pp. 49:1-49:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Copy BibTex To Clipboard

@InProceedings{holland_et_al:LIPIcs.ISAAC.2020.49,
  author =	{Holland, William L. and Wirth, Anthony and Zobel, Justin},
  title =	{{Recency Queries with Succinct Representation}},
  booktitle =	{31st International Symposium on Algorithms and Computation (ISAAC 2020)},
  pages =	{49:1--49:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-173-3},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{181},
  editor =	{Cao, Yixin and Cheng, Siu-Wing and Li, Minming},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2020.49},
  URN =		{urn:nbn:de:0030-drops-133931},
  doi =		{10.4230/LIPIcs.ISAAC.2020.49},
  annote =	{Keywords: Succinct Data Structures, Data Streams, Sliding Dictionary}
}
Document
From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442)

Authors: Nicola Ferro, Norbert Fuhr, Gregory Grefenstette, Joseph A. Konstan, Pablo Castells, Elizabeth M. Daly, Thierry Declerck, Michael D. Ekstrand, Werner Geyer, Julio Gonzalo, Tsvi Kuflik, Krister Lindén, Bernardo Magnini, Jian-Yun Nie, Raffaele Perego, Bracha Shapira, Ian Soboroff, Nava Tintarev, Karin Verspoor, Martijn C. Willemsen, and Justin Zobel

Published in: Dagstuhl Manifestos, Volume 7, Issue 1 (2018)


Abstract
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making underlying assumptions explicit, identifying application features determining performance, and the development of prediction models describing the relationship between assumptions, features and resulting performance.

Cite as

Nicola Ferro, Norbert Fuhr, Gregory Grefenstette, Joseph A. Konstan, Pablo Castells, Elizabeth M. Daly, Thierry Declerck, Michael D. Ekstrand, Werner Geyer, Julio Gonzalo, Tsvi Kuflik, Krister Lindén, Bernardo Magnini, Jian-Yun Nie, Raffaele Perego, Bracha Shapira, Ian Soboroff, Nava Tintarev, Karin Verspoor, Martijn C. Willemsen, and Justin Zobel. From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442). In Dagstuhl Manifestos, Volume 7, Issue 1, pp. 96-139, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@Article{ferro_et_al:DagMan.7.1.96,
  author =	{Ferro, Nicola and Fuhr, Norbert and Grefenstette, Gregory and Konstan, Joseph A. and Castells, Pablo and Daly, Elizabeth M. and Declerck, Thierry and Ekstrand, Michael D. and Geyer, Werner and Gonzalo, Julio and Kuflik, Tsvi and Lind\'{e}n, Krister and Magnini, Bernardo and Nie, Jian-Yun and Perego, Raffaele and Shapira, Bracha and Soboroff, Ian and Tintarev, Nava and Verspoor, Karin and Willemsen, Martijn C. and Zobel, Justin},
  title =	{{From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442)}},
  pages =	{96--139},
  journal =	{Dagstuhl Manifestos},
  ISSN =	{2193-2433},
  year =	{2018},
  volume =	{7},
  number =	{1},
  editor =	{Ferro, Nicola and Fuhr, Norbert and Grefenstette, Gregory and Konstan, Joseph A. and Castells, Pablo and Daly, Elizabeth M. and Declerck, Thierry and Ekstrand, Michael D. and Geyer, Werner and Gonzalo, Julio and Kuflik, Tsvi and Lind\'{e}n, Krister and Magnini, Bernardo and Nie, Jian-Yun and Perego, Raffaele and Shapira, Bracha and Soboroff, Ian and Tintarev, Nava and Verspoor, Karin and Willemsen, Martijn C. and Zobel, Justin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagMan.7.1.96},
  URN =		{urn:nbn:de:0030-drops-98987},
  doi =		{10.4230/DagMan.7.1.96},
  annote =	{Keywords: Information Systems, Formal models, Evaluation, Simulation, User Interaction}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail