Hierarchical Relative Lempel-Ziv Compression

Authors Philip Bille , Inge Li Gørtz , Simon J. Puglisi , Simon R. Tarnow



PDF
Thumbnail PDF

File

LIPIcs.SEA.2023.18.pdf
  • Filesize: 0.68 MB
  • 16 pages

Document Identifiers

Author Details

Philip Bille
  • Department of Computer Science and Applied Mathematics, Technical University of Denmark, Lyngby, Denmark
Inge Li Gørtz
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark
Simon J. Puglisi
  • Helsinki Institute for Information Technology (HIIT), Finland
  • Department of Computer Science, University of Helsinki, Finland
Simon R. Tarnow
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark

Cite AsGet BibTex

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Hierarchical Relative Lempel-Ziv Compression. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.SEA.2023.18

Abstract

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Relative compression
  • Lempel-Ziv compression
  • RLZ
  • LZ77
  • string collections
  • compressed representation
  • data structures
  • efficient algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Coronavirus genomes – NCBI datasets. Accessed 18/05/2022, URL: https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/.
  2. Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen, Hjalte Wedel Vildhøj, and Søren Vind. Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica, 80(11):3207-3224, 2018. Google Scholar
  3. Philip Bille and Inge Li Gørtz. Random access in persistent strings. In Proc. 31st ISAAC, 2020. Google Scholar
  4. P. M. Camerini, L. Fratta, and F. Maffioli. A note on finding optimum branchings. Networks, 9(4):309-312, 1979. URL: https://doi.org/10.1002/net.3230090403.
  5. Sebastian Deorowicz, Agnieszka Danek, and Szymon Grabowski. Genome compression: a novel approach for large collections. Bioinformatics, 29(20):2572-2578, 2013. Google Scholar
  6. Sebastian Deorowicz and Szymon Grabowski. Robust relative compression of genomes with random access. Bioinformatics, 27(21):2979-2986, 2011. Google Scholar
  7. Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19-51, 1997. Google Scholar
  8. Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative lempel-ziv self-index for similar sequences. Theor. Comput. Sci., 532:14-30, 2014. Google Scholar
  9. Andrea Farruggia, Travis Gagie, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. Relative suffix trees. Comput. J., 61(5):773-788, 2018. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. On compressing the textual web. In Proc. 3rd WSDM, pages 391-400, 2010. Google Scholar
  11. Paolo Ferragina, Igor Nitto, and Rossano Venturini. On the bit-complexity of Lempel-Ziv compression. SIAM J. Comput., 42(4):1521-1541, 2013. Google Scholar
  12. Michael L Fredman, Robert Sedgewick, Daniel D Sleator, and Robert E Tarjan. The pairing heap: A new form of self-adjusting heap. Algorithmica, 1(1):111-129, 1986. Google Scholar
  13. Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammar-based self-index. In Proc. 6th LATA, pages 240-251, 2012. Google Scholar
  14. Christopher Hoobin, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endowment, 5(3):265-273, 2011. Google Scholar
  15. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Proc. 24th DCC, pages 153-162, 2014. Google Scholar
  16. Dominik Kempa and Ben Langmead. Fast and space-efficient construction of AVL grammars from the LZ77 parsing. In Proc. 29th ESA, pages 56:1-56:14, 2021. Google Scholar
  17. Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020. Google Scholar
  18. Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. 17th SPIRE, pages 201-206, 2010. Google Scholar
  19. Daniel H. Larkin, Siddhartha Sen, and Robert Endre Tarjan. A back-to-basics empirical study of priority queues. In Proc. 16th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 61-72. SIAM, 2014. Google Scholar
  20. Kewen Liao, Matthias Petri, Alistair Moffat, and Anthony Wirth. Effective construction of relative lempel-ziv dictionaries. In Proc. 25th WWW, pages 807-816, 2016. Google Scholar
  21. Tommi Mäklin, Teemu Kallonen, Jarno Alanko, Ørjan Samuelsen, Kristin Hegstad, Veli Mäkinen, Jukka Corander, Eva Heinz, and Antti Honkela. Bacterial genomic epidemiology with mixed samples. Microbial Genomics, 7(11), 2021. Google Scholar
  22. Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Matching reads to many genomes with the r-index. J. Comput. Biol., 27(4):514-518, 2020. Google Scholar
  23. Gonzalo Navarro and Victor Sepulveda. Practical indexing of repetitive collections using relative Lempel-Ziv. In Proc. 29th DCC, pages 201-210, 2019. Google Scholar
  24. Gonzalo Navarro, Victor Sepulveda, Mauricio Marín, and Senén González. Compressed filesystem for managing large genome collections. Bioinformatics, 35(20):4120-4128, 2019. Google Scholar
  25. Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendafilov. Cluster-based delta compression of a collection of files. In Proc. 3rd WISE, pages 257-266, 2002. Google Scholar
  26. Matthias Petri, Alistair Moffat, P. C. Nagesh, and Anthony Wirth. Access time tradeoffs in archive compression. In Proc. 11th AIRS, pages 15-28, 2015. Google Scholar
  27. Simon J. Puglisi and Bella Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Proc. SPIRE, LNCS 12303, pages 89-96. Springer, 2020. Google Scholar
  28. Simon J. Puglisi and Bella Zhukova. Document retrieval hacks. In Proc. 19th SEA, pages 12:1-12:12, 2021. Google Scholar
  29. Simon J. Puglisi and Bella Zhukova. Smaller RLZ-compressed suffix arrays. In Proc. 31st DCC, 2021. Google Scholar
  30. E.L. Stevens et al. The public health impact of a publically available, environmental database of microbial genomes. Front. Microbiol., 8(808), 2017. Google Scholar
  31. James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. Google Scholar
  32. R. E. Tarjan. Finding optimum branchings. Networks, 7(1):25-35, 1977. URL: https://doi.org/10.1002/net.3230070103.
  33. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  34. Jiancong Tong, Anthony Wirth, and Justin Zobel. Compact auxiliary dictionaries for incremental compression of large repositories. In Proc. 23rd CIKM, pages 1629-1638, 2014. Google Scholar
  35. Jiancong Tong, Anthony Wirth, and Justin Zobel. Principled dictionary pruning for low-memory corpus compression. In Proc. 37th SIGIR, pages 283-292, 2014. Google Scholar
  36. Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC Genom., 19(S2), 2018. Google Scholar
  37. John William Joseph Williams. Algorithm 232: heapsort. Commun. ACM, 7:347-348, 1964. Google Scholar