Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

Author Nicola Prezza



PDF
Thumbnail PDF

File

LIPIcs.SEA.2020.3.pdf
  • Filesize: 210 kB
  • 2 pages

Document Identifiers

Author Details

Nicola Prezza
  • LUISS Guido Carli, Rome, Italy

Cite AsGet BibTex

Nicola Prezza. Indexing Compressed Text: A Tale of Time and Space (Invited Talk). In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 3:1-3:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.SEA.2020.3

Abstract

Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [Weiner, 1973], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [Grossi and Vitter, 2000; Ferragina and Manzini, 2000] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [Li and Durbin, 2009; Langmead et al., 2009]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [S. Kreft and G. Navarro, 2013]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [S. Kreft and G. Navarro, 2013; Gagie et al., 2020; F. Claude and G. Navarro, 2012], together with generalizations of the indexing problem to labeled graphs [Ferragina et al., 2009; Sirén et al., 2014; Travis Gagie et al., 2017]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Sorting and searching
  • Theory of computation → Pattern matching
Keywords
  • Compressed Text Indexing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. F. Claude and G. Navarro. Improved grammar-based compressed indexes. In Proc. 19th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 7608, pages 180-192, 2012. Google Scholar
  2. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), November 2009. URL: https://doi.org/10.1145/1613676.1613680.
  3. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, 2000., pages 390-398. IEEE, 2000. Google Scholar
  4. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. Algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo). URL: https://doi.org/10.1016/j.tcs.2017.06.016.
  5. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020. URL: https://doi.org/10.1145/3375890.
  6. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, STOC ’00, page 397–406, New York, NY, USA, 2000. Association for Computing Machinery. URL: https://doi.org/10.1145/335305.335351.
  7. S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013. Google Scholar
  8. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome biology, 10(3):R25, 2009. Google Scholar
  9. Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  10. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 11(2):375–388, March 2014. URL: https://doi.org/10.1109/TCBB.2013.2297101.
  11. Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT'08. IEEE Conference Record of 14th Annual Symposium on, pages 1-11. IEEE, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail