Document Open Access Logo

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Authors Jamshed Khan , Tobias Rubel , Laxman Dhulipala , Erin Molloy , Rob Patro



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.16.pdf
  • Filesize: 0.95 MB
  • 21 pages

Document Identifiers

Author Details

Jamshed Khan
  • University of Maryland, College Park, MD, USA
Tobias Rubel
  • University of Maryland, College Park, MD, USA
Laxman Dhulipala
  • University of Maryland, College Park, MD, USA
Erin Molloy
  • University of Maryland, College Park, MD, USA
Rob Patro
  • University of Maryland, College Park, MD, USA

Cite AsGet BibTex

Jamshed Khan, Tobias Rubel, Laxman Dhulipala, Erin Molloy, and Rob Patro. Fast, Parallel, and Cache-Friendly Suffix Array Construction. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 16:1-16:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.16

Abstract

String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.

Subject Classification

ACM Subject Classification
  • Theory of computation → Sorting and searching
Keywords
  • Suffix Array
  • Longest Common Prefix
  • Data Structures
  • Indexing
  • Parallel Algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of discrete algorithms, 2(1):53-86, 2004. Google Scholar
  2. Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D. Singer, Brunilda Balliu, David Koslicki, Pavel Skums, Alex Zelikovsky, Can Alkan, Onur Mutlu, and Serghei Mangul. Technology dictates algorithms: recent developments in read alignment. Genome Biology, 22(1):249, August 2021. URL: https://doi.org/10.1186/s13059-021-02443-7.
  3. Daniel Anderson, Guy E. Blelloch, Laxman Dhulipala, Magdalen Dobson, and Yihan Sun. The problem-based benchmark suite (PBBS), v2. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22, pages 445-447, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3503221.3508422.
  4. Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th Annual European Symposium on Algorithms (ESA 2017), volume 87 of Leibniz International Proceedings in Informatics (LIPIcs), pages 9:1-9:14. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.ESA.2017.9.
  5. Timo Bingmann. Scalable string and suffix sorting: Algorithms, techniques, and tools. arXiv preprint arXiv:1808.00963, 2018. Google Scholar
  6. Timo Bingmann, Patrick Dinklage, Johannes Fischer, Florian Kurpicz, Enno Ohlebusch, and Peter Sanders. Scalable Text Index Construction, pages 252-284. Springer Nature Switzerland, Cham, 2022. URL: https://doi.org/10.1007/978-3-031-21534-6_14.
  7. Timo Bingmann, Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, 77:235-286, 2017. Google Scholar
  8. Timo Bingmann and Peter Sanders. Parallel string sample sort. In Algorithms-ESA 2013: 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings 21, pages 169-180. Springer, 2013. Google Scholar
  9. Guy E Blelloch, Daniel Anderson, and Laxman Dhulipala. Parlaylib-a toolkit for parallel algorithms on shared-memory multicore machines. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, pages 507-509, 2020. Google Scholar
  10. Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv., 54(1), March 2021. URL: https://doi.org/10.1145/3445967.
  11. J Shane Culpepper, Matthias Petri, and Simon J Puglisi. Revisiting bounded context block-sorting transformations. Software: Practice and Experience, 42(8):1037-1054, 2012. Google Scholar
  12. Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15-21, 2013. Google Scholar
  13. M. Farach. Optimal suffix tree construction with large alphabets. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  14. Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. In Prague Stringology Conference 2017, page 62, 2017. Google Scholar
  15. Johannes Fischer and Florian Kurpicz. Lightweight Distributed Suffix Array Construction, pages 27-38. Society for Industrial and Applied Mathematics, 2019. URL: https://doi.org/10.1137/1.9781611975499.3.
  16. Patrick Flick and Srinivas Aluru. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/2807591.2807609.
  17. W Donald Frazer and Archie C McKellar. Samplesort: A sampling approach to minimal storage tree sorting. Journal of the ACM (JACM), 17(3):496-507, 1970. Google Scholar
  18. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/CBO9780511574931.
  19. Scott Hazelhurst and Zsuzsanna Lipták. KABOOM! a new suffix array based algorithm for clustering expression data. Bioinformatics, 27(24):3348-3355, December 2011. Google Scholar
  20. Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295-302, February 2011. Google Scholar
  21. Hideo Itoh and Hozumi Tanaka. An efficient method for in memory construction of suffix arrays. In 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No. PR00268), pages 81-88. IEEE, 1999. Google Scholar
  22. Juha Kärkkäinen and Dominik Kempa. Engineering a lightweight external memory suffix array construction algorithm. Mathematics in Computer Science, 11:137-149, 2017. Google Scholar
  23. Juha Kärkkäinen and Dominik Kempa. Engineering external memory LCP array construction: Parallel, in-place and large alphabet. In 16th International Symposium on Experimental Algorithms (SEA 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  24. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching, pages 329-342, Cham, 2015. Springer International Publishing. Google Scholar
  25. Juha Kärkkäinen, Dominik Kempa, Simon J Puglisi, and Bella Zhukova. Engineering external memory induced suffix sorting. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 98-108. SIAM, 2017. Google Scholar
  26. Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Automata, Languages and Programming: 30th International Colloquium, ICALP 2003 Eindhoven, The Netherlands, June 30-July 4, 2003 Proceedings 30, pages 943-955. Springer, 2003. Google Scholar
  27. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM (JACM), 53(6):918-936, 2006. Google Scholar
  28. Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de bruijn graphs with Cuttlefish 2. Genome Biology, 23(1):190, September 2022. URL: https://doi.org/10.1186/s13059-022-02743-6.
  29. Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Linear-time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003 Morelia, Michoacán, Mexico, June 25-27, 2003 Proceedings 14, pages 186-199. Springer, 2003. Google Scholar
  30. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003 Morelia, Michoacán, Mexico, June 25-27, 2003 Proceedings, pages 200-210. Springer, 2003. Google Scholar
  31. Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Computing, 33(9):605-612, 2007. Google Scholar
  32. Julian Labeit, Julian Shun, and Guy E Blelloch. Parallel lightweight wavelet tree, suffix array and fm-index construction. Journal of Discrete Algorithms, 43:2-17, 2017. Google Scholar
  33. Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. In String Processing and Information Retrieval: 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings, pages 268-284. Springer, 2018. Google Scholar
  34. Gang Liao, Longfei Ma, Guangming Zang, and Lin Tang. Parallel DC3 algorithm for suffix array construction on many-core accelerators. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 1155-1158, 2015. URL: https://doi.org/10.1109/CCGrid.2015.56.
  35. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935-948, 1993. Google Scholar
  36. Nicholas J. Marra, Michael J. Stanhope, Nathaniel K. Jue, Minghui Wang, Qi Sun, Paulina Pavinski Bitar, Vincent P. Richards, Aleksey Komissarov, Mike Rayko, Sergey Kliver, Bryce J. Stanhope, Chuck Winkler, Stephen J. O’Brien, Agostinho Antunes, Salvador Jorgensen, and Mahmood S. Shivji. White shark genome reveals ancient elasmobranch adaptations associated with wound healing and the maintenance of genome stability. Proceedings of the National Academy of Sciences, 116(10):4446-4455, 2019. URL: https://doi.org/10.1073/pnas.1819778116.
  37. Yuta Mori. divsufsort. https://github.com/y-256/libdivsufsort, 2015. Accessed on 1 May 2023.
  38. Waihong Ng and Katsuhiko Kakehi. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 4:69-78, 2008. Google Scholar
  39. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE transactions on computers, 60(10):1471-1484, 2010. Google Scholar
  40. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. Google Scholar
  41. Matthias Petri, Gonzalo Navarro, J Shane Culpepper, and Simon J Puglisi. Backwards search in context bound text transformations. In 2011 First International Conference on Data Compression, Communications and Processing, pages 82-91. IEEE, 2011. Google Scholar
  42. Anton Pirogov, Peter Pfaffelhuber, Angelika Börsch-Haubold, and Bernhard Haubold. High-complexity regions in mammalian genomes are enriched for developmental genes. Bioinformatics, 35(11):1813-1819, 2019. Google Scholar
  43. Simon J Puglisi, William F Smyth, and Andrew H Turpin. A taxonomy of suffix array construction algorithms. acm Computing Surveys (CSUR), 39(2):4-es, 2007. Google Scholar
  44. Peter Sanders and Sebastian Winkel. Super scalar sample sort. In Algorithms-ESA 2004: 12th Annual European Symposium, Bergen, Norway, September 14-17, 2004. Proceedings 12, pages 784-796. Springer, 2004. Google Scholar
  45. M. Schindler. A fast block-sorting algorithm for lossless data compression. In Proceedings DCC '97. Data Compression Conference, pages 469-, 1997. URL: https://doi.org/10.1109/DCC.1997.582137.
  46. Anish Man Singh Shrestha, Martin C Frith, and Paul Horton. A bioinformatician’s guide to the forefront of suffix array construction algorithms. Briefings in bioinformatics, 15(2):138-154, 2014. Google Scholar
  47. Jeramiah J Smith, Nataliya Timoshevskaya, Vladimir A Timoshevskiy, Melissa C Keinath, Drew Hardy, and S Randal Voss. A chromosome-scale assembly of the axolotl genome. Genome Res., 29(2):317-324, February 2019. Google Scholar
  48. Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics, 29(6):802-804, March 2013. Google Scholar
  49. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
  50. Yuzhen Ye, Jeong-Hyeon Choi, and Haixu Tang. RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinformatics, 12(1):159, May 2011. Google Scholar
  51. Kaiyuan Zhu, Alejandro A Schäffer, Welles Robinson, Junyan Xu, Eytan Ruppin, A Funda Ergun, Yuzhen Ye, and S Cenk Sahinalp. Strain level microbial detection and quantification with applications to single cell metagenomics. Nature Communications, 13(1):6430, 2022. Google Scholar
  52. Justin M. Zook, David Catoe, Jennifer McDaniel, Lindsay Vang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu, Christopher E. Mason, Noah Alexander, Elizabeth Henaff, Alexa B.R. McIntyre, Dhruva Chandramohan, Feng Chen, Erich Jaeger, Ali Moshrefi, Khoa Pham, William Stedman, Tiffany Liang, Michael Saghbini, Zeljko Dzakula, Alex Hastie, Han Cao, Gintaras Deikus, Eric Schadt, Robert Sebra, Ali Bashir, Rebecca M. Truty, Christopher C. Chang, Natali Gulbahce, Keyan Zhao, Srinka Ghosh, Fiona Hyland, Yutao Fu, Mark Chaisson, Chunlin Xiao, Jonathan Trow, Stephen T. Sherry, Alexander W. Zaranek, Madeleine Ball, Jason Bobe, Preston Estep, George M. Church, Patrick Marks, Sofia Kyriazopoulou-Panagiotopoulou, Grace X.Y. Zheng, Michael Schnall-Levin, Heather S. Ordonez, Patrice A. Mudivarti, Kristina Giorda, Ying Sheng, Karoline Bjarnesdatter Rypdal, and Marc Salit. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3(1):160025, June 2016. URL: https://doi.org/10.1038/sdata.2016.25.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail