Fast, Parallel, and Cache-Friendly Suffix Array Construction

Khan, Jamshed; Rubel, Tobias; Dhulipala, Laxman; Molloy, Erin; Patro, Rob

doi:10.4230/LIPIcs.WABI.2023.16

Abstract

String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of discrete algorithms, 2(1):53-86, 2004.
Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D. Singer, Brunilda Balliu, David Koslicki, Pavel Skums, Alex Zelikovsky, Can Alkan, Onur Mutlu, and Serghei Mangul. Technology dictates algorithms: recent developments in read alignment. Genome Biology, 22(1):249, August 2021. URL: https://doi.org/10.1186/s13059-021-02443-7.
Daniel Anderson, Guy E. Blelloch, Laxman Dhulipala, Magdalen Dobson, and Yihan Sun. The problem-based benchmark suite (PBBS), v2. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22, pages 445-447, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3503221.3508422.
Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. In-Place Parallel Super Scalar Samplesort (IPSSSSo). In 25th Annual European Symposium on Algorithms (ESA 2017), volume 87 of Leibniz International Proceedings in Informatics (LIPIcs), pages 9:1-9:14. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.ESA.2017.9.
Timo Bingmann. Scalable string and suffix sorting: Algorithms, techniques, and tools. arXiv preprint arXiv:1808.00963, 2018.
Timo Bingmann, Patrick Dinklage, Johannes Fischer, Florian Kurpicz, Enno Ohlebusch, and Peter Sanders. Scalable Text Index Construction, pages 252-284. Springer Nature Switzerland, Cham, 2022. URL: https://doi.org/10.1007/978-3-031-21534-6_14.
Timo Bingmann, Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, 77:235-286, 2017.
Timo Bingmann and Peter Sanders. Parallel string sample sort. In Algorithms-ESA 2013: 21st Annual European Symposium, Sophia Antipolis, France, September 2-4, 2013. Proceedings 21, pages 169-180. Springer, 2013.
Guy E Blelloch, Daniel Anderson, and Laxman Dhulipala. Parlaylib-a toolkit for parallel algorithms on shared-memory multicore machines. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, pages 507-509, 2020.
Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv., 54(1), March 2021. URL: https://doi.org/10.1145/3445967.
J Shane Culpepper, Matthias Petri, and Simon J Puglisi. Revisiting bounded context block-sorting transformations. Software: Practice and Experience, 42(8):1037-1054, 2012.
Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15-21, 2013.
M. Farach. Optimal suffix tree construction with large alphabets. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. In Prague Stringology Conference 2017, page 62, 2017.
Johannes Fischer and Florian Kurpicz. Lightweight Distributed Suffix Array Construction, pages 27-38. Society for Industrial and Applied Mathematics, 2019. URL: https://doi.org/10.1137/1.9781611975499.3.
Patrick Flick and Srinivas Aluru. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/2807591.2807609.
W Donald Frazer and Archie C McKellar. Samplesort: A sampling approach to minimal storage tree sorting. Journal of the ACM (JACM), 17(3):496-507, 1970.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/CBO9780511574931.
Scott Hazelhurst and Zsuzsanna Lipták. KABOOM! a new suffix array based algorithm for clustering expression data. Bioinformatics, 27(24):3348-3355, December 2011.
Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295-302, February 2011.
Hideo Itoh and Hozumi Tanaka. An efficient method for in memory construction of suffix arrays. In 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No. PR00268), pages 81-88. IEEE, 1999.
Juha Kärkkäinen and Dominik Kempa. Engineering a lightweight external memory suffix array construction algorithm. Mathematics in Computer Science, 11:137-149, 2017.
Juha Kärkkäinen and Dominik Kempa. Engineering external memory LCP array construction: Parallel, in-place and large alphabet. In 16th International Symposium on Experimental Algorithms (SEA 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching, pages 329-342, Cham, 2015. Springer International Publishing.
Juha Kärkkäinen, Dominik Kempa, Simon J Puglisi, and Bella Zhukova. Engineering external memory induced suffix sorting. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 98-108. SIAM, 2017.
Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Automata, Languages and Programming: 30th International Colloquium, ICALP 2003 Eindhoven, The Netherlands, June 30-July 4, 2003 Proceedings 30, pages 943-955. Springer, 2003.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM (JACM), 53(6):918-936, 2006.
Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de bruijn graphs with Cuttlefish 2. Genome Biology, 23(1):190, September 2022. URL: https://doi.org/10.1186/s13059-022-02743-6.
Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Linear-time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003 Morelia, Michoacán, Mexico, June 25-27, 2003 Proceedings 14, pages 186-199. Springer, 2003.
Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003 Morelia, Michoacán, Mexico, June 25-27, 2003 Proceedings, pages 200-210. Springer, 2003.
Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Computing, 33(9):605-612, 2007.
Julian Labeit, Julian Shun, and Guy E Blelloch. Parallel lightweight wavelet tree, suffix array and fm-index construction. Journal of Discrete Algorithms, 43:2-17, 2017.
Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. In String Processing and Information Retrieval: 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings, pages 268-284. Springer, 2018.
Gang Liao, Longfei Ma, Guangming Zang, and Lin Tang. Parallel DC3 algorithm for suffix array construction on many-core accelerators. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 1155-1158, 2015. URL: https://doi.org/10.1109/CCGrid.2015.56.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935-948, 1993.
Nicholas J. Marra, Michael J. Stanhope, Nathaniel K. Jue, Minghui Wang, Qi Sun, Paulina Pavinski Bitar, Vincent P. Richards, Aleksey Komissarov, Mike Rayko, Sergey Kliver, Bryce J. Stanhope, Chuck Winkler, Stephen J. O’Brien, Agostinho Antunes, Salvador Jorgensen, and Mahmood S. Shivji. White shark genome reveals ancient elasmobranch adaptations associated with wound healing and the maintenance of genome stability. Proceedings of the National Academy of Sciences, 116(10):4446-4455, 2019. URL: https://doi.org/10.1073/pnas.1819778116.
Yuta Mori. divsufsort. https://github.com/y-256/libdivsufsort, 2015. Accessed on 1 May 2023.
Waihong Ng and Katsuhiko Kakehi. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 4:69-78, 2008.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE transactions on computers, 60(10):1471-1484, 2010.
Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022.
Matthias Petri, Gonzalo Navarro, J Shane Culpepper, and Simon J Puglisi. Backwards search in context bound text transformations. In 2011 First International Conference on Data Compression, Communications and Processing, pages 82-91. IEEE, 2011.
Anton Pirogov, Peter Pfaffelhuber, Angelika Börsch-Haubold, and Bernhard Haubold. High-complexity regions in mammalian genomes are enriched for developmental genes. Bioinformatics, 35(11):1813-1819, 2019.
Simon J Puglisi, William F Smyth, and Andrew H Turpin. A taxonomy of suffix array construction algorithms. acm Computing Surveys (CSUR), 39(2):4-es, 2007.
Peter Sanders and Sebastian Winkel. Super scalar sample sort. In Algorithms-ESA 2004: 12th Annual European Symposium, Bergen, Norway, September 14-17, 2004. Proceedings 12, pages 784-796. Springer, 2004.
M. Schindler. A fast block-sorting algorithm for lossless data compression. In Proceedings DCC '97. Data Compression Conference, pages 469-, 1997. URL: https://doi.org/10.1109/DCC.1997.582137.
Anish Man Singh Shrestha, Martin C Frith, and Paul Horton. A bioinformatician’s guide to the forefront of suffix array construction algorithms. Briefings in bioinformatics, 15(2):138-154, 2014.
Jeramiah J Smith, Nataliya Timoshevskaya, Vladimir A Timoshevskiy, Melissa C Keinath, Drew Hardy, and S Randal Voss. A chromosome-scale assembly of the axolotl genome. Genome Res., 29(2):317-324, February 2019.
Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics, 29(6):802-804, March 2013.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Yuzhen Ye, Jeong-Hyeon Choi, and Haixu Tang. RAPSearch: a fast protein similarity search tool for short reads. BMC Bioinformatics, 12(1):159, May 2011.
Kaiyuan Zhu, Alejandro A Schäffer, Welles Robinson, Junyan Xu, Eytan Ruppin, A Funda Ergun, Yuzhen Ye, and S Cenk Sahinalp. Strain level microbial detection and quantification with applications to single cell metagenomics. Nature Communications, 13(1):6430, 2022.
Justin M. Zook, David Catoe, Jennifer McDaniel, Lindsay Vang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu, Christopher E. Mason, Noah Alexander, Elizabeth Henaff, Alexa B.R. McIntyre, Dhruva Chandramohan, Feng Chen, Erich Jaeger, Ali Moshrefi, Khoa Pham, William Stedman, Tiffany Liang, Michael Saghbini, Zeljko Dzakula, Alex Hastie, Han Cao, Gintaras Deikus, Eric Schadt, Robert Sebra, Ali Bashir, Rebecca M. Truty, Christopher C. Chang, Natali Gulbahce, Keyan Zhao, Srinka Ghosh, Fiona Hyland, Yutao Fu, Mark Chaisson, Chunlin Xiao, Jonathan Trow, Stephen T. Sherry, Alexander W. Zaranek, Madeleine Ball, Jason Bobe, Preston Estep, George M. Church, Patrick Marks, Sofia Kyriazopoulou-Panagiotopoulou, Grace X.Y. Zheng, Michael Schnall-Levin, Heather S. Ordonez, Patrice A. Mudivarti, Kristina Giorda, Ying Sheng, Karoline Bjarnesdatter Rypdal, and Marc Salit. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data, 3(1):160025, June 2016. URL: https://doi.org/10.1038/sdata.2016.25.

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Authors Jamshed Khan , Tobias Rubel , Laxman Dhulipala , Erin Molloy , Rob Patro

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Authors Jamshed Khan , Tobias Rubel , Laxman Dhulipala , Erin Molloy , Rob Patro

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message