4 Search Results for "Ferragina, Paolo"


Document
Learned Monotone Minimal Perfect Hashing

Authors: Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra

Published in: LIPIcs, Volume 274, 31st Annual European Symposium on Algorithms (ESA 2023)


Abstract
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor.

Cite as

Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned Monotone Minimal Perfect Hashing. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 46:1-46:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:LIPIcs.ESA.2023.46,
  author =	{Ferragina, Paolo and Lehmann, Hans-Peter and Sanders, Peter and Vinciguerra, Giorgio},
  title =	{{Learned Monotone Minimal Perfect Hashing}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{46:1--46:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023.46},
  URN =		{urn:nbn:de:0030-drops-186990},
  doi =		{10.4230/LIPIcs.ESA.2023.46},
  annote =	{Keywords: compressed data structure, monotone minimal perfect hashing, retrieval}
}
Document
Invited Talk
Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)

Authors: Sharma V. Thankachan

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)


Abstract
In the past two decades, we have witnessed the design of various compact data structures for pattern matching over an indexed text [Navarro, 2016]. Popular indexes like the FM-index [Paolo Ferragina and Giovanni Manzini, 2005], compressed suffix arrays/trees [Roberto Grossi and Jeffrey Scott Vitter, 2005; Kunihiko Sadakane, 2007], the recent r-index [Travis Gagie et al., 2020; Takaaki Nishimoto and Yasuo Tabei, 2021], etc., capture the key functionalities of classic suffix arrays/trees [Udi Manber and Eugene W. Myers, 1993; Peter Weiner, 1973] in compact space. Mostly, they rely on the Burrows-Wheeler Transform (BWT) and its associated operations [Burrows and Wheeler, 1994]. However, compactly encoding some advanced suffix tree (ST) variants, like parameterized ST [Brenda S. Baker, 1993; S. Rao Kosaraju, 1995; Juan Mendivelso et al., 2020], order-isomorphic/preserving ST [Maxime Crochemore et al., 2016], two-dimensional ST [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998], etc. [Sung Gwan Park et al., 2019; Tetsuo Shibuya, 2000]- collectively known as suffix trees with missing suffix links [Richard Cole and Ramesh Hariharan, 2003], has been challenging. The previous techniques are not easily extendable because these variants do not hold some structural properties of the standard ST that enable compression. However, some limited progress has been made in these directions recently [Arnab Ganguly et al., 2017; Travis Gagie et al., 2017; Gianni Decaroli et al., 2017; Dhrumil Patel and Rahul Shah, 2021; Arnab Ganguly et al., 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Arnab Ganguly et al., 2017; Arnab Ganguly et al., 2022; Arnab Ganguly et al., 2021]. This talk will briefly survey them and highlight some interesting open problems.

Cite as

Sharma V. Thankachan. Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk). In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 3:1-3:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{thankachan:LIPIcs.CPM.2022.3,
  author =	{Thankachan, Sharma V.},
  title =	{{Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc.}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{3:1--3:3},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.3},
  URN =		{urn:nbn:de:0030-drops-161300},
  doi =		{10.4230/LIPIcs.CPM.2022.3},
  annote =	{Keywords: Text Indexing, Suffix Trees, String Matching}
}
Document
Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space

Authors: Dhrumil Patel and Rahul Shah

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)


Abstract
In a 2-dimensional (2D) pattern matching problem, the text is arranged as a matrix 𝖬[1..n, 1..n] and consists of N = n × n symbols drawn from alphabet set Σ of size σ. The query consists of a m × m square matrix 𝖯[1..m, 1..m] drawn from the same alphabet set Σ and the task is to find all the locations in 𝖬 where 𝖯 appears as a (contiguous) submatrix. The patterns can be of any size, but as long as they are square in shape data structures like suffix trees and suffix array exist [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998] for the task of efficient pattern matching. These are essentially 2D counterparts of classic suffix trees and arrays known for traditional 1-dimensional (1D) pattern matching. They work based on linearization of 2D suffixes which would preserve the prefix match property (i.e., every pattern match is a prefix of some suffix). The main limitation of the suffix trees and the suffix arrays (in 1D) was their space utilization of O(N log N) bits, where N is the size of the text. This was suboptimal compared to Nlog σ bits of space, which is information theoretic optimal for the text. With the advent of the field of succinct/compressed data structures, it was possible to develop compressed variants of suffix trees and array based on Burrows-Wheeler Tansform and LF-mapping (or Φ function) [Roberto Grossi and Jeffrey Scott Vitter, 2005; Paolo Ferragina and Giovanni Manzini, 2005; Kunihiko Sadakane, 2007]. These data structures indeed achieve O(N log σ) bits of space or better. This gives rise to the question: analogous to 1D case, can we design a succinct or compressed index for 2D pattern matching? Can there be a 2D compressed suffix tree? Are there analogues of Burrows-Wheeler Transform or LF-mapping? The problem has been acknowledged for over a decade now and there have been a few attempts at applying Φ function [Ankur Gupta, 2004] and achieving entropy based compression [Veli Mäkinen and Gonzalo Navarro, 2008]. However, achieving the complexity breakthrough akin to 1D case has yet to be found. In this paper, we still do not know how to answer suffix array queries in O(N log σ) bits of space - which would have led to efficient pattern matching. However, for the first time, we show an interesting result that it is indeed possible to compute inverse suffix array (ISA) queries in near compact space in O(polylog n) time. Our 2D succinct text index design is based on two 1D compressed suffix trees and it takes O(N log log N + N logσ) bits of space which is much smaller than its naive design that takes O(N log N) bits. Although the main problem is still evasive, this index gives a hope on the existence of a full 2D succinct index with all functionalities similar to that of 1D case.

Cite as

Dhrumil Patel and Rahul Shah. Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 60:1-60:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{patel_et_al:LIPIcs.ISAAC.2021.60,
  author =	{Patel, Dhrumil and Shah, Rahul},
  title =	{{Inverse Suffix Array Queries for 2-Dimensional Pattern Matching in Near-Compact Space}},
  booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
  pages =	{60:1--60:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-214-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{212},
  editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.60},
  URN =		{urn:nbn:de:0030-drops-154932},
  doi =		{10.4230/LIPIcs.ISAAC.2021.60},
  annote =	{Keywords: Pattern Matching, Succinct Data Structures}
}
Document
Repetition- and Linearity-Aware Rank/Select Dictionaries

Authors: Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)


Abstract
We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first contribution is a Lempel-Ziv parsing properly enriched to also capture approximate linearity in the data and still be compressed to the kth order entropy. Our second contribution is a variant of the block tree structure whose space complexity takes advantage of both repetitiveness and approximate linearity, and results highly competitive in time too. Our third and final contribution is an implementation and experimentation of this last data structure, which achieves new space-time trade-offs compared to known data structures that exploit only one of the two regularities.

Cite as

Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. Repetition- and Linearity-Aware Rank/Select Dictionaries. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 64:1-64:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:LIPIcs.ISAAC.2021.64,
  author =	{Ferragina, Paolo and Manzini, Giovanni and Vinciguerra, Giorgio},
  title =	{{Repetition- and Linearity-Aware Rank/Select Dictionaries}},
  booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
  pages =	{64:1--64:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-214-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{212},
  editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.64},
  URN =		{urn:nbn:de:0030-drops-154974},
  doi =		{10.4230/LIPIcs.ISAAC.2021.64},
  annote =	{Keywords: Data compression, Compressed data structures, Entropy}
}
  • Refine by Author
  • 2 Ferragina, Paolo
  • 2 Vinciguerra, Giorgio
  • 1 Lehmann, Hans-Peter
  • 1 Manzini, Giovanni
  • 1 Patel, Dhrumil
  • Show More...

  • Refine by Classification
  • 2 Theory of computation → Data compression
  • 2 Theory of computation → Pattern matching
  • 1 Information systems → Point lookups
  • 1 Theory of computation → Data structures design and analysis

  • Refine by Keyword
  • 1 Compressed data structures
  • 1 Data compression
  • 1 Entropy
  • 1 Pattern Matching
  • 1 String Matching
  • Show More...

  • Refine by Type
  • 4 document

  • Refine by Publication Year
  • 2 2021
  • 1 2022
  • 1 2023

Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail