License: Creative Commons Attribution 3.0 Unported license (CC-BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.SEA.2020.3
URN: urn:nbn:de:0030-drops-120772
URL: https://drops.dagstuhl.de/opus/volltexte/2020/12077/
Go to the corresponding LIPIcs Volume Portal


Prezza, Nicola

Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

pdf-format:
LIPIcs-SEA-2020-3.pdf (0.2 MB)


Abstract

Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [Weiner, 1973], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [Grossi and Vitter, 2000; Ferragina and Manzini, 2000] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [Li and Durbin, 2009; Langmead et al., 2009]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [S. Kreft and G. Navarro, 2013]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [S. Kreft and G. Navarro, 2013; Gagie et al., 2020; F. Claude and G. Navarro, 2012], together with generalizations of the indexing problem to labeled graphs [Ferragina et al., 2009; Sirén et al., 2014; Travis Gagie et al., 2017]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs.

BibTeX - Entry

@InProceedings{prezza:LIPIcs:2020:12077,
  author =	{Nicola Prezza},
  title =	{{Indexing Compressed Text: A Tale of Time and Space (Invited Talk)}},
  booktitle =	{18th International Symposium on Experimental Algorithms (SEA 2020)},
  pages =	{3:1--3:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-148-1},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{160},
  editor =	{Simone Faro and Domenico Cantone},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2020/12077},
  URN =		{urn:nbn:de:0030-drops-120772},
  doi =		{10.4230/LIPIcs.SEA.2020.3},
  annote =	{Keywords: Compressed Text Indexing}
}

Keywords: Compressed Text Indexing
Collection: 18th International Symposium on Experimental Algorithms (SEA 2020)
Issue Date: 2020
Date of publication: 12.06.2020


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI