Hierarchical Relative Lempel-Ziv Compression

Bille, Philip; Gørtz, Inge Li; Puglisi, Simon J.; Tarnow, Simon R.

doi:10.4230/LIPIcs.SEA.2023.18

Abstract

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

Coronavirus genomes – NCBI datasets. Accessed 18/05/2022, URL: https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/.
Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen, Hjalte Wedel Vildhøj, and Søren Vind. Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica, 80(11):3207-3224, 2018.
Philip Bille and Inge Li Gørtz. Random access in persistent strings. In Proc. 31st ISAAC, 2020.
P. M. Camerini, L. Fratta, and F. Maffioli. A note on finding optimum branchings. Networks, 9(4):309-312, 1979. URL: https://doi.org/10.1002/net.3230090403.
Sebastian Deorowicz, Agnieszka Danek, and Szymon Grabowski. Genome compression: a novel approach for large collections. Bioinformatics, 29(20):2572-2578, 2013.
Sebastian Deorowicz and Szymon Grabowski. Robust relative compression of genomes with random access. Bioinformatics, 27(21):2979-2986, 2011.
Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19-51, 1997.
Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative lempel-ziv self-index for similar sequences. Theor. Comput. Sci., 532:14-30, 2014.
Andrea Farruggia, Travis Gagie, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. Relative suffix trees. Comput. J., 61(5):773-788, 2018.
Paolo Ferragina and Giovanni Manzini. On compressing the textual web. In Proc. 3rd WSDM, pages 391-400, 2010.
Paolo Ferragina, Igor Nitto, and Rossano Venturini. On the bit-complexity of Lempel-Ziv compression. SIAM J. Comput., 42(4):1521-1541, 2013.
Michael L Fredman, Robert Sedgewick, Daniel D Sleator, and Robert E Tarjan. The pairing heap: A new form of self-adjusting heap. Algorithmica, 1(1):111-129, 1986.
Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammar-based self-index. In Proc. 6th LATA, pages 240-251, 2012.
Christopher Hoobin, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endowment, 5(3):265-273, 2011.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Proc. 24th DCC, pages 153-162, 2014.
Dominik Kempa and Ben Langmead. Fast and space-efficient construction of AVL grammars from the LZ77 parsing. In Proc. 29th ESA, pages 56:1-56:14, 2021.
Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020.
Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. 17th SPIRE, pages 201-206, 2010.
Daniel H. Larkin, Siddhartha Sen, and Robert Endre Tarjan. A back-to-basics empirical study of priority queues. In Proc. 16th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 61-72. SIAM, 2014.
Kewen Liao, Matthias Petri, Alistair Moffat, and Anthony Wirth. Effective construction of relative lempel-ziv dictionaries. In Proc. 25th WWW, pages 807-816, 2016.
Tommi Mäklin, Teemu Kallonen, Jarno Alanko, Ørjan Samuelsen, Kristin Hegstad, Veli Mäkinen, Jukka Corander, Eva Heinz, and Antti Honkela. Bacterial genomic epidemiology with mixed samples. Microbial Genomics, 7(11), 2021.
Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Matching reads to many genomes with the r-index. J. Comput. Biol., 27(4):514-518, 2020.
Gonzalo Navarro and Victor Sepulveda. Practical indexing of repetitive collections using relative Lempel-Ziv. In Proc. 29th DCC, pages 201-210, 2019.
Gonzalo Navarro, Victor Sepulveda, Mauricio Marín, and Senén González. Compressed filesystem for managing large genome collections. Bioinformatics, 35(20):4120-4128, 2019.
Zan Ouyang, Nasir Memon, Torsten Suel, and Dimitre Trendafilov. Cluster-based delta compression of a collection of files. In Proc. 3rd WISE, pages 257-266, 2002.
Matthias Petri, Alistair Moffat, P. C. Nagesh, and Anthony Wirth. Access time tradeoffs in archive compression. In Proc. 11th AIRS, pages 15-28, 2015.
Simon J. Puglisi and Bella Zhukova. Relative Lempel-Ziv compression of suffix arrays. In Proc. SPIRE, LNCS 12303, pages 89-96. Springer, 2020.
Simon J. Puglisi and Bella Zhukova. Document retrieval hacks. In Proc. 19th SEA, pages 12:1-12:12, 2021.
Simon J. Puglisi and Bella Zhukova. Smaller RLZ-compressed suffix arrays. In Proc. 31st DCC, 2021.
E.L. Stevens et al. The public health impact of a publically available, environmental database of microbial genomes. Front. Microbiol., 8(808), 2017.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982.
R. E. Tarjan. Finding optimum branchings. Networks, 7(1):25-35, 1977. URL: https://doi.org/10.1002/net.3230070103.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.
Jiancong Tong, Anthony Wirth, and Justin Zobel. Compact auxiliary dictionaries for incremental compression of large repositories. In Proc. 23rd CIKM, pages 1629-1638, 2014.
Jiancong Tong, Anthony Wirth, and Justin Zobel. Principled dictionary pruning for low-memory corpus compression. In Proc. 37th SIGIR, pages 283-292, 2014.
Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC Genom., 19(S2), 2018.
John William Joseph Williams. Algorithm 232: heapsort. Commun. ACM, 7:347-348, 1964.

Hierarchical Relative Lempel-Ziv Compression

Authors Philip Bille , Inge Li Gørtz , Simon J. Puglisi , Simon R. Tarnow

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Hierarchical Relative Lempel-Ziv Compression

Authors Philip Bille , Inge Li Gørtz , Simon J. Puglisi , Simon R. Tarnow

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message