On the Complexity of BWT-Runs Minimization via Alphabet Reordering

Authors Jason W. Bentley, Daniel Gibney, Sharma V. Thankachan



PDF
Thumbnail PDF

File

LIPIcs.ESA.2020.15.pdf
  • Filesize: 0.52 MB
  • 13 pages

Document Identifiers

Author Details

Jason W. Bentley
  • Department of Mathematics, University of Central Florida, Orlando, FL, USA
Daniel Gibney
  • Department of Computer Science, University of Central Florida, Orlando, FL, USA
Sharma V. Thankachan
  • Department of Computer Science, University of Central Florida, Orlando, FL, USA

Acknowledgements

We would like to thank the reviewers for their valuable feedback and Chandra Chekuri for his helpful correspondence.

Cite AsGet BibTex

Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the Complexity of BWT-Runs Minimization via Alphabet Reordering. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ESA.2020.15

Abstract

The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to entropy-based lower bound. Within the last decade, it has seen its role further enhanced with the development of compact suffix trees in space proportional to "r", the number of runs in the BWT. While r would superficially appear to be only a measure of space complexity, it is actually appearing increasingly often in the time complexity of new algorithms as well. This makes having the smallest value of r of growing importance. Interestingly, unlike other popular measures of compression, the parameter r is sensitive to the lexicographic ordering given to the text’s alphabet. Despite several past attempts to exploit this fact, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes r has been open for years. We help to explain this lack of progress by presenting the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time poly(n)⋅ 2^o(σ) unless the Exponential Time Hypothesis fails, where σ is the size of the alphabet and n is the length of the text. Moreover, we show that the optimization variant is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path of a graph and the number of runs in the BWT of a text. In addition, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides an O(log² n)-approximation. Lastly, we provide an optimal linear-time algorithm for a more restricted problem of finding an optimal ordering on a subset of symbols (occurring only once) under ordering constraints.

Subject Classification

ACM Subject Classification
  • Theory of computation → Problems, reductions and completeness
Keywords
  • BWT
  • NP-hardness
  • APX-hardness

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jürgen Abel. Post BWT stages of the burrows-wheeler compression algorithm. Softw., Pract. Exper., 40(9):751-777, 2010. URL: https://doi.org/10.1002/spe.982.
  2. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 911-930, 2020. URL: https://doi.org/10.1137/1.9781611975994.55.
  3. Hideo Bannai, Travis Gagie, et al. Online lz77 parsing and matching statistics with rlbwts. In Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  4. Jason Bentley, Daniel Gibney, and Sharma V. Thankachan. On the complexity of bwt-runs minimization via alphabet reordering. CoRR, abs/1911.03035, 2019. URL: http://arxiv.org/abs/1911.03035.
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big bwts. Algorithms for Molecular Biology, 14(1):13, 2019. Google Scholar
  6. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. SRC Research Report, 124, 1994. Google Scholar
  7. Bastien Cazaux and Eric Rivals. Linking BWT and XBW via aho-corasick automaton: Applications to run-length encoding. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 24:1-24:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.24.
  8. Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics, 28(11):1415-1419, May 2012. URL: https://doi.org/10.1093/bioinformatics/bts173.
  9. Pierluigi Crescenzi. A short guide to approximation preserving reductions. In Proceedings of Computational Complexity. Twelfth Annual IEEE Conference, pages 262-273. IEEE, 1997. Google Scholar
  10. Marek Cygan, Fedor V Fomin, Łukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. Lower bounds based on the exponential-time hypothesis. In Parameterized Algorithms, pages 467-521. Springer, 2015. Google Scholar
  11. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987-1011, 2000. URL: https://doi.org/10.1145/355541.355547.
  12. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390-398, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  13. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theor. Comput. Sci., 698:67-78, 2017. URL: https://doi.org/10.1016/j.tcs.2017.06.016.
  14. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 1459-1477, 2018. URL: https://doi.org/10.1137/1.9781611975031.96.
  15. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), January 2020. URL: https://doi.org/10.1145/3375890.
  16. Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Block sorting-based transformations on words: Beyond the magic BWT. In Developments in Language Theory - 22nd International Conference, DLT 2018, Tokyo, Japan, September 10-14, 2018, Proceedings, pages 1-17, 2018. URL: https://doi.org/10.1007/978-3-319-98654-8_1.
  17. Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino. A new class of searchable and provably highly compressible string transformations. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 12:1-12:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.12.
  18. Daniel Gibney and Sharma V. Thankachan. On the hardness and inapproximability of recognizing wheeler graphs. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 51:1-51:16, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.51.
  19. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1344-1357, 2019. URL: https://doi.org/10.1137/1.9781611975482.82.
  20. Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. CoRR, abs/1910.10631, 2019. URL: http://arxiv.org/abs/1910.10631.
  21. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827-840, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  22. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  23. Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. In Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Washington, DC, USA, May 5-8, 2019, Proceedings, pages 158-173, 2019. URL: https://doi.org/10.1007/978-3-030-17083-7_10.
  24. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome biology, 10(3):R25, 2009. Google Scholar
  25. Heng Li and Richard Durbin. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26(5):589-595, 2010. Google Scholar
  26. Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009. Google Scholar
  27. Daniel Lokshtanov, Dániel Marx, and Saket Saurabh. Lower bounds based on the exponential time hypothesis. Bulletin of the EATCS, 105:41-72, 2011. URL: http://eatcs.org/beatcs/index.php/beatcs/article/view/92.
  28. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Combinatorial Pattern Matching, 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19-22, 2005, Proceedings, pages 45-56, 2005. URL: https://doi.org/10.1007/11496656_5.
  29. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of individual genomes. In Research in Computational Molecular Biology, 13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 18-21, 2009. Proceedings, pages 121-137, 2009. URL: https://doi.org/10.1007/978-3-642-02008-7_9.
  30. Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. Google Scholar
  31. Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, I Tomohiro, and Hiroshi Sakamoto. A faster implementation of online rlbwt and its application to lz77 parsing. Journal of Discrete Algorithms, 52:18-28, 2018. Google Scholar
  32. Christos H. Papadimitriou and Mihalis Yannakakis. Optimization, approximation, and complexity classes. J. Comput. Syst. Sci., 43(3):425-440, 1991. URL: https://doi.org/10.1016/0022-0000(91)90023-X.
  33. Christos H. Papadimitriou and Mihalis Yannakakis. The traveling salesman problem with distances one and two. Math. Oper. Res., 18(1):1-11, 1993. URL: https://doi.org/10.1287/moor.18.1.1.
  34. Lianping Yang, Guisong Chang, Xiangde Zhang, and Tianming Wang. Use of the burrows-wheeler similarity distribution to the comparison of the proteins. Amino acids, 39(3):887-898, 2010. Google Scholar