FM-Index Reveals the Reverse Suffix Array

Ganguly, Arnab; Gibney, Daniel; Hooshmand, Sahar; Külekci, M. Oğuzhan; Thankachan, Sharma V.

doi:10.4230/LIPIcs.CPM.2020.13

Abstract

Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time.
We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.

Srinivas Aluru. Handbook of Computational Molecular Biology. Chapman & Hall/CRC, 2005.
Amihood Amir, Dmitry Keselman, Gad M. Landau, Moshe Lewenstein, Noa Lewenstein, and Michael Rodeh. Text indexing and dictionary matching with one error. Journal of Algorithms, 37(2):309-325, 2000. URL: https://doi.org/10.1006/jagm.2000.1104.
Diego Arroyuelo, Gonzalo Navarro, and Kunihiko Sadakane. Stronger Lempel-Ziv based compressed text indexing. Algorithmica, 62(1-2):54-101, 2012. URL: https://doi.org/10.1007/s00453-010-9443-8.
Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile succinct representations of the bidirectional burrows-wheeler transform. In Algorithms - ESA 2013 - 21st Annual European Symposium, pages 133-144, 2013. URL: https://doi.org/10.1007/978-3-642-40450-4_12.
Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, and Jouni Sirén. Relative FM-indexes. In String Processing and Information Retrieval - 21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014. Proceedings, pages 52-64, 2014. URL: https://doi.org/10.1007/978-3-319-11918-2_6.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Ljubljana, Slovenia, September 10-12, 2012. Proceedings, pages 225-235, 2012. URL: https://doi.org/10.1007/978-3-642-33122-0_18.
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (now part of Hewlett-Packard, Palo Alto, CA), 1994.
Luc Devroye, Wojciech Szpankowski, and Bonita Rais. A note on the height of suffix trees. SIAM J. Comput., 21(1):48-53, 1992. URL: https://doi.org/10.1137/0221005.
Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative lempel-ziv self-index for similar sequences. Theor. Comput. Sci., 532:14-30, 2014. URL: https://doi.org/10.1016/j.tcs.2013.07.024.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1), 2009. An extended abstract appeared in FOCS 2005 under the title "Structuring labeled trees for optimal succinctness, and beyond". URL: https://doi.org/10.1145/1613676.1613680.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. An extended abstract appeared in FOCS 2000 under the title "Opportunistic Data Structures with Applications". URL: https://doi.org/10.1145/1082036.1082039.
Paolo Ferragina, Jouni Sirén, and Rossano Venturini. Distribution-aware compressed full-text indexes. Algorithmica, 67(4):529-546, 2013.
Travis Gagie, Simon J. Puglisi, and Andrew Turpin. Range quantile queries: Another virtue of wavelet trees. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval, SPIRE '09, pages 1-6, Berlin, Heidelberg, 2009. Springer-Verlag. URL: https://doi.org/10.1007/978-3-642-03784-9_1.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual Symposium on Discrete Algorithms ACM-SIAM, January 12-14, 2003, Baltimore, Maryland, USA., pages 841-850, 2003.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. An extended abstract appeared in STOC 2000. URL: https://doi.org/10.1137/S0097539702402354.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1988. AAI8918056.
Tak Wah Lam, Ruiqiang Li, Alan Tam, Simon C. K. Wong, Edward Wu, and Siu-Ming Yiu. High throughput short read alignment via bi-directional BWT. In 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, Washington, DC, USA, November 1-4, 2009, Proceedings, pages 31-36, 2009. URL: https://doi.org/10.1109/BIBM.2009.42.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3):R25, 2009.
Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009.
Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5):473-483, 2010.
Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: An improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009.
Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. URL: https://doi.org/10.1145/321941.321946.
Martin D. Muggli, Alexander Bowe, Noelle R. Noyes, Paul S. Morley, Keith E. Belk, Robert Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. URL: https://doi.org/10.1093/bioinformatics/btx067.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1), 2007. URL: https://doi.org/10.1145/1216370.1216372.
Enno Ohlebusch, Timo Beller, and Mohamed I Abouelhoda. Computing the Burrows-Wheeler transform of a string and its reverse in parallel. Journal of Discrete Algorithms, 25:21-33, 2014.
Alessio Orlandi and Rossano Venturini. Space-efficient substring occurrence estimation. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 95-106, 2011. URL: https://doi.org/10.1145/1989284.1989300.
Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Information and Computation, 213:13-22, 2012. URL: https://doi.org/10.1016/j.ic.2011.03.007.
Wing-Kin Sung. Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, 1st edition, 2009.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.

FM-Index Reveals the Reverse Suffix Array

Authors Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, Sharma V. Thankachan

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

FM-Index Reveals the Reverse Suffix Array

Authors Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, Sharma V. Thankachan

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message