2 Search Results for "Cobas, Dustin"

Document

DOI: 10.4230/LIPIcs.CPM.2021.13

A Fast and Small Subsampled R-Index

Authors: Dustin Cobas, Travis Gagie, and Gonzalo Navarro

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.

Cite as

Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A Fast and Small Subsampled R-Index. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{cobas_et_al:LIPIcs.CPM.2021.13,
  author =	{Cobas, Dustin and Gagie, Travis and Navarro, Gonzalo},
  title =	{{A Fast and Small Subsampled R-Index}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{13:1--13:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.13},
  URN =		{urn:nbn:de:0030-drops-139647},
  doi =		{10.4230/LIPIcs.CPM.2021.13},
  annote =	{Keywords: Pattern matching, r-index, compressed text indexing, repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.SEA.2021.12

Document Retrieval Hacks

Authors: Simon J. Puglisi and Bella Zhukova

Published in: LIPIcs, Volume 190, 19th International Symposium on Experimental Algorithms (SEA 2021)

Abstract

Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.

Cite as

Simon J. Puglisi and Bella Zhukova. Document Retrieval Hacks. In 19th International Symposium on Experimental Algorithms (SEA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 190, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{puglisi_et_al:LIPIcs.SEA.2021.12,
  author =	{Puglisi, Simon J. and Zhukova, Bella},
  title =	{{Document Retrieval Hacks}},
  booktitle =	{19th International Symposium on Experimental Algorithms (SEA 2021)},
  pages =	{12:1--12:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-185-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{190},
  editor =	{Coudert, David and Natale, Emanuele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2021.12},
  URN =		{urn:nbn:de:0030-drops-137848},
  doi =		{10.4230/LIPIcs.SEA.2021.12},
  annote =	{Keywords: String Processing, Pattern matching, Document listing, Document retrieval, Succinct data structures, Repetitive text collections}
}

Refine by Author
1 Cobas, Dustin
1 Gagie, Travis
1 Navarro, Gonzalo
1 Puglisi, Simon J.
1 Zhukova, Bella

Refine by Classification
1 Information systems → Data compression
1 Theory of computation → Pattern matching

Refine by Keyword
2 Pattern matching
1 Document listing
1 Document retrieval
1 Repetitive text collections
1 String Processing
Show More...

Refine by Type
2 document

Refine by Publication Year
2 2021

2 Search Results for "Cobas, Dustin"

A Fast and Small Subsampled R-Index

Abstract

Cite as

Document Retrieval Hacks

Abstract

Cite as

Thanks for your feedback!

Could not send message