Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

Authors Juha Kärkkäinen, Dominik Kempa



PDF
Thumbnail PDF

File

LIPIcs.SEA.2017.17.pdf
  • Filesize: 0.59 MB
  • 14 pages

Document Identifiers

Author Details

Juha Kärkkäinen
Dominik Kempa

Cite AsGet BibTex

Juha Kärkkäinen and Dominik Kempa. Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.SEA.2017.17

Abstract

The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet.
Keywords
  • LCP array
  • suffix array
  • external memory algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004. URL: http://dx.doi.org/10.1016/S1570-8667(03)00065-0.
  2. T. Bingmann, J. Fischer, and V. Osipov. Inducing suffix and LCP arrays in external memory. In Proceedings of the 15th Workshop on Algorithm Engineering and Experiments (ALENEX 2013), pages 88-102. SIAM, 2013. URL: http://dx.doi.org/10.1137/1.9781611972931.8.
  3. G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. New indices for text: Pat trees and Pat arrays. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 66-82. Prentice-Hall, 1992. Google Scholar
  4. J. Kärkkäinen and D. Kempa. Faster external memory LCP array construction. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA 2016), volume 57 of LIPIcs, pages 61:1-61:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. URL: http://dx.doi.org/10.4230/LIPIcs.ESA.2016.61.
  5. J. Kärkkäinen and D. Kempa. LCP array construction in external memory. J. Exp. Algorithmics, 21(1):1.7:1-1.7:22, April 2016. URL: http://dx.doi.org/10.1145/2851491.
  6. J. Kärkkäinen and D. Kempa. LCP array construction using O(sort(n)) (or less) I/Os. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE 2016), volume 9954 of LNCS, pages 204-217. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-46049-9_20.
  7. J. Kärkkäinen, D. Kempa, and M. Pia̧tkowski. Tighter bounds for the sum of irreducible LCP values. In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015), volume 9133 of LNCS, pages 316-328. Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-319-19929-0_27.
  8. J. Kärkkäinen, D. Kempa, S. J. Puglisi, and B. Zhukova. Engineering external memory induced suffix sorting. In Proceedings of the 19th Workshop on Algorithm Engineering and Experiments (ALENEX 2017), pages 98-108. SIAM, 2017. URL: http://dx.doi.org/10.1137/1.9781611974768.8.
  9. J. Kärkkäinen, G. Manzini, and S. J. Puglisi. Permuted longest-common-prefix array. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), volume 5577 of LNCS, pages 181-192. Springer, 2009. URL: http://dx.doi.org/10.1007/978-3-642-02441-2_17.
  10. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM 2001), volume 2089 of LNCS, pages 181-192. Springer, 2001. URL: http://dx.doi.org/10.1007/3-540-48194-X_17.
  11. V. Mäkinen, D. Belazzougui, F. Cunial, and A. I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. Google Scholar
  12. U. Manber and G. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: http://dx.doi.org/10.1137/0222058.
  13. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):article 2, 2007. URL: http://dx.doi.org/10.1145/1216370.1216372.
  14. E. Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  15. K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pages 225-232. ACM/SIAM, 2002. Google Scholar
  16. G. Tischler. Low space external memory construction of the succinct permuted longest common prefix array. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE 2016), volume 9954 of LNCS, pages 178-190. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-46049-9_18.
  17. H. E. Williams and J. Zobel. Compressing integers for fast file access. Comput. J., 42(3):193-201, 1999. URL: http://dx.doi.org/10.1093/comjnl/42.3.193.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail