Fast and Lightweight Distributed Suffix Array Construction

Haag, Manuel; Kurpicz, Florian; Sanders, Peter; Schimek, Matthias

doi:10.4230/LIPIcs.ESA.2025.47

Abstract

The suffix array contains the lexicographical order of all suffixes of a text. It is one of the most well-studied text indices with applications in bioinformatics, compression, and pattern matching. The main bottleneck of distributed-memory suffix array construction algorithms is their memory requirements. Even careful implementations require 30×-60× the input size as working memory. We present a scalable and lightweight distributed-memory adaptation of the difference cover (DCX) suffix array construction algorithm. Our approach relies on novel bucketing and random chunk redistribution techniques which reduce our memory requirement to 20×-26× the input size for medium-sized inputs and to 14×-15× for large-sized inputs. Regarding running time, we achieve speedups of up to 5× over current state-of-the-art distributed suffix array construction algorithms.

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004. URL: https://doi.org/10.1016/S1570-8667(03)00065-0.
Michael Axtmann, Timo Bingmann, Peter Sanders, and Christian Schulz. Practical massively parallel sorting. In SPAA, pages 13-23. ACM, 2015. URL: https://doi.org/10.1145/2755573.2755595.
Michael Axtmann and Peter Sanders. Robust Massively Parallel Sorting, pages 83-97. SIAM, 2017. URL: https://doi.org/10.1137/1.9781611974768.7.
Johannes Bahne, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, Florian Kurpicz, Marvin Löbel, Oliver Magiera, Rosa Pink, David Piper, and Christopher Poeplau. Sacabench: Benchmarking suffix array construction. In SPIRE, volume 11811 of Lecture Notes in Computer Science, pages 407-416. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_29.
Uwe Baier. Linear-time suffix sorting - A new approach for suffix array construction. In CPM, volume 54 of LIPIcs, pages 23:1-23:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPICS.CPM.2016.23.
Timo Bingmann. pdcx. https://github.com/bingmann/pDCX, 2018.
Timo Bingmann. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. PhD thesis, Karlsruhe Institute of Technology, Germany, 2018. URL: https://publikationen.bibliothek.kit.edu/1000085031.
Timo Bingmann, Michael Axtmann, Emanuel Jöbstl, Sebastian Lamm, Huyen Chau Nguyen, Alexander Noe, Sebastian Schlag, Matthias Stumpp, Tobias Sturm, and Peter Sanders. Thrill: High-performance algorithmic distributed batch data processing with c++. In 2016 IEEE International Conference on Big Data (Big Data), pages 172-183. IEEE, 2016. URL: https://doi.org/10.1109/BIGDATA.2016.7840603.
Timo Bingmann, Patrick Dinklage, Johannes Fischer, Florian Kurpicz, Enno Ohlebusch, and Peter Sanders. Scalable text index construction. In Algorithms for Big Data, volume 13201 of Lecture Notes in Computer Science, pages 252-284. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-21534-6_14.
Timo Bingmann, Simon Gog, and Florian Kurpicz. Scalable construction of text indexes with thrill. In IEEE BigData, pages 634-643. IEEE, 2018. URL: https://doi.org/10.1109/BIGDATA.2018.8622171.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. URL: https://doi.org/10.1093/ACPROF:OSO/9780199535255.001.0001.
Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center, 1994.
Charles J Colbourn and Alan CH Ling. Quorums from difference covers. Information Processing Letters, 75(1-2):9-12, 2000. URL: https://doi.org/10.1016/S0020-0190(00)00080-6.
Common Crawl. Common Crawl WET Files from CC-MAIN-2019-09. https://commoncrawl.org, 2019. Downloaded WET files from segments: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319- #ID.warc.wet, where #ID ranges from 00000 to 000639, and https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316- #ID.warc.wet, where #ID ranges from 00000 to 00199. Only textual content was retained; HTML tags and Common Crawl metadata were removed.
Roman Dementiev, Juha Kärkkäinen, Jens Mehnert, and Peter Sanders. Better external memory suffix array construction. ACM J. Exp. Algorithmics, 12:3.4:1-3.4:24, 2008. URL: https://doi.org/10.1145/1227161.1402296.
Michael Ferguson. ssort_chpl: Chapel-based suffix sorting module, 2025. Accessed: 2025-04-21. URL: https://github.com/femto-dev/femto/tree/main/src/ssort_chpl.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398. IEEE Computer Society, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. In Stringology, pages 62-76. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2017. URL: http://www.stringology.org/event/2017/p07.html.
Johannes Fischer and Florian Kurpicz. Lightweight distributed suffix array construction. In ALENEX, pages 27-38. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975499.3.
Patrick Flick and Srinivas Aluru. Parallel distributed memory construction of suffix and longest common prefix arrays. In SC, pages 16:1-16:10. ACM, 2015. URL: https://doi.org/10.1145/2807591.2807609.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. New indices for text: Pat trees and pat arrays. In Information Retrieval: Data Structures & Algorithms, pages 66-82. Prentice-Hall, 1992.
Keisuke Goto. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. In Stringology, pages 111-125. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2019. URL: http://www.stringology.org/event/2019/p11.html.
Ilya Grebnov. libsais: A fast linear time suffix array and burrows-wheeler transform construction library. https://github.com/IlyaGrebnov/libsais, 2025. Accessed: 2025-04-19.
Hideo Itoh and Hozumi Tanaka. An efficient method for in memory construction of suffix arrays. In SPIRE/CRIWG, pages 81-88. IEEE, 1999. URL: https://doi.org/10.1109/SPIRE.1999.796581.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Comput., 33(9):605-612, 2007. URL: https://doi.org/10.1016/J.PARCO.2007.06.004.
Florian Kurpicz, Pascal Mehnert, Peter Sanders, and Matthias Schimek. Scalable distributed string sorting. In ESA, volume 308 of LIPIcs, pages 83:1-83:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024. URL: https://doi.org/10.4230/LIPICS.ESA.2024.83.
Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. In DCC, page 422. IEEE, 2018. URL: https://doi.org/10.1109/DCC.2018.00075.
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In SODA, pages 319-327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218.
Pascal Mehnert. Scalable distributed string sorting algorithms. Master’s thesis, Karlsruhe Institute of Technology, Germany, 2024.
Yuta Mori. libdivsufsort. https://github.com/y-256/libdivsufsort, 2015.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. URL: https://doi.org/10.1109/TC.2010.188.
Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013.
Peter Sanders, Kurt Mehlhorn, Martin Dietzfelbinger, and Roman Dementiev. Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-25209-0.
Hanmao Shi and Jonathan Schaeffer. Parallel sorting by regular sampling. J. Parallel Distributed Comput., 14(4):361-372, 1992. URL: https://doi.org/10.1016/0743-7315(92)90075-X.
Zachary D Stephens., Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. Big data: Astronomical or genomical? PLOS Biology, 13(7):1-11, July 2015. URL: https://doi.org/10.1371/journal.pbio.1002195.
The 1000 Genomes Project. FASTQ Files from the 1000 Genomes Project. ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR000/, 2025. Downloaded FASTQ files with numbers DRR000001 to DRR000439 ( excluding DRR000394). Only the DNA sequence lines were retained; characters other than A, C, G, and T were removed.
The UniProt. FASTA Files from the Universal Protein Resource (UniProt). https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/fasta/active/ . From https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/fasta/active/uniparc_active_p #ID.fasta.gz, where #ID ranges from 1 to 200. Only sequence representations were retained.
Tim Niklas Uhl, Matthias Schimek, Lukas Hübner, Demian Hespe, Florian Kurpicz, Daniel Seemaier, Christoph Stelz, and Peter Sanders. KaMPIng: Flexible and (near) zero-overhead C++ bindings for MPI. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '24. IEEE Press, 2024. URL: https://doi.org/10.1109/SC41406.2024.00050.
Wikimedia Foundation. Wikipedia XML Dumps (March 2019) for de, en, es, fr. https://dumps.wikimedia.org/, 2019. Downloaded files available at https://dumps.wikimedia.org/# IDwiki/20190320/#IDwiki-20190320-pages-meta-current.xml.bz2 , where #ID is de, en, es, and fr.

Fast and Lightweight Distributed Suffix Array Construction

Authors Manuel Haag , Florian Kurpicz , Peter Sanders , Matthias Schimek

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Fast and Lightweight Distributed Suffix Array Construction

Authors Manuel Haag , Florian Kurpicz , Peter Sanders , Matthias Schimek

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message