FM-Index Reveals the Reverse Suffix Array

Authors Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, Sharma V. Thankachan



PDF
Thumbnail PDF

File

LIPIcs.CPM.2020.13.pdf
  • Filesize: 0.85 MB
  • 14 pages

Document Identifiers

Author Details

Arnab Ganguly
  • Department of Computer Science, University of Wisconsin - Whitewater, WI, USA
Daniel Gibney
  • Department of Computer Science, University of Central Florida, Orlando, FL, USA
Sahar Hooshmand
  • Department of Computer Science, University of Central Florida, Orlando, FL, USA
M. Oğuzhan Külekci
  • Informatics Institute, Istanbul Technical University, Turkey
Sharma V. Thankachan
  • Department of Computer Science, University of Central Florida, Orlando, FL, USA

Cite AsGet BibTex

Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan. FM-Index Reveals the Reverse Suffix Array. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.CPM.2020.13

Abstract

Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time. We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • Data Structures
  • Suffix Trees
  • String Algorithms
  • Compression
  • Burrows - Wheeler transform
  • FM-Index

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Srinivas Aluru. Handbook of Computational Molecular Biology. Chapman & Hall/CRC, 2005. Google Scholar
  2. Amihood Amir, Dmitry Keselman, Gad M. Landau, Moshe Lewenstein, Noa Lewenstein, and Michael Rodeh. Text indexing and dictionary matching with one error. Journal of Algorithms, 37(2):309-325, 2000. URL: https://doi.org/10.1006/jagm.2000.1104.
  3. Diego Arroyuelo, Gonzalo Navarro, and Kunihiko Sadakane. Stronger Lempel-Ziv based compressed text indexing. Algorithmica, 62(1-2):54-101, 2012. URL: https://doi.org/10.1007/s00453-010-9443-8.
  4. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile succinct representations of the bidirectional burrows-wheeler transform. In Algorithms - ESA 2013 - 21st Annual European Symposium, pages 133-144, 2013. URL: https://doi.org/10.1007/978-3-642-40450-4_12.
  5. Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, and Jouni Sirén. Relative FM-indexes. In String Processing and Information Retrieval - 21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014. Proceedings, pages 52-64, 2014. URL: https://doi.org/10.1007/978-3-319-11918-2_6.
  6. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Ljubljana, Slovenia, September 10-12, 2012. Proceedings, pages 225-235, 2012. URL: https://doi.org/10.1007/978-3-642-33122-0_18.
  7. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (now part of Hewlett-Packard, Palo Alto, CA), 1994. Google Scholar
  8. Luc Devroye, Wojciech Szpankowski, and Bonita Rais. A note on the height of suffix trees. SIAM J. Comput., 21(1):48-53, 1992. URL: https://doi.org/10.1137/0221005.
  9. Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative lempel-ziv self-index for similar sequences. Theor. Comput. Sci., 532:14-30, 2014. URL: https://doi.org/10.1016/j.tcs.2013.07.024.
  10. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  11. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1), 2009. An extended abstract appeared in FOCS 2005 under the title "Structuring labeled trees for optimal succinctness, and beyond". URL: https://doi.org/10.1145/1613676.1613680.
  12. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. An extended abstract appeared in FOCS 2000 under the title "Opportunistic Data Structures with Applications". URL: https://doi.org/10.1145/1082036.1082039.
  13. Paolo Ferragina, Jouni Sirén, and Rossano Venturini. Distribution-aware compressed full-text indexes. Algorithmica, 67(4):529-546, 2013. Google Scholar
  14. Travis Gagie, Simon J. Puglisi, and Andrew Turpin. Range quantile queries: Another virtue of wavelet trees. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval, SPIRE '09, pages 1-6, Berlin, Heidelberg, 2009. Springer-Verlag. URL: https://doi.org/10.1007/978-3-642-03784-9_1.
  15. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  16. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual Symposium on Discrete Algorithms ACM-SIAM, January 12-14, 2003, Baltimore, Maryland, USA., pages 841-850, 2003. Google Scholar
  17. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. An extended abstract appeared in STOC 2000. URL: https://doi.org/10.1137/S0097539702402354.
  18. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  19. Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1988. AAI8918056. Google Scholar
  20. Tak Wah Lam, Ruiqiang Li, Alan Tam, Simon C. K. Wong, Edward Wu, and Siu-Ming Yiu. High throughput short read alignment via bi-directional BWT. In 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, Washington, DC, USA, November 1-4, 2009, Proceedings, pages 31-36, 2009. URL: https://doi.org/10.1109/BIBM.2009.42.
  21. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3):R25, 2009. Google Scholar
  22. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  23. Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5):473-483, 2010. Google Scholar
  24. Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: An improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009. Google Scholar
  25. Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. URL: https://doi.org/10.1145/321941.321946.
  26. Martin D. Muggli, Alexander Bowe, Noelle R. Noyes, Paul S. Morley, Keith E. Belk, Robert Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. URL: https://doi.org/10.1093/bioinformatics/btx067.
  27. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1), 2007. URL: https://doi.org/10.1145/1216370.1216372.
  28. Enno Ohlebusch, Timo Beller, and Mohamed I Abouelhoda. Computing the Burrows-Wheeler transform of a string and its reverse in parallel. Journal of Discrete Algorithms, 25:21-33, 2014. Google Scholar
  29. Alessio Orlandi and Rossano Venturini. Space-efficient substring occurrence estimation. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 95-106, 2011. URL: https://doi.org/10.1145/1989284.1989300.
  30. Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Information and Computation, 213:13-22, 2012. URL: https://doi.org/10.1016/j.ic.2011.03.007.
  31. Wing-Kin Sung. Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, 1st edition, 2009. Google Scholar
  32. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
  33. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail