From Relative Compression to Hierarchical Compression

Bille, Philip; Gørtz, Inge Li; Pérez-López, Máximo

doi:10.4230/LIPIcs.SEA.2026.7

Abstract

We introduce a framework to use any relative compression algorithm as a subroutine for hierarchical relative compression. In a dataset consisting of n sequences, it consists of constructing a rooted tree on the sequences, using hashing and similarity techniques, and compressing the children of a node relative to their parent. We build up on previous techniques [Bille et al., 2023], and optimize them further for computational efficiency. We test our framework with three existing relative compression algorithms on six genomic datasets, and we show that in datasets that contain heterogeneous data, hierarchical relative compression improves the compression ratio by a factor 2 or more, when compared to relative compression to a single sequence. Apart from compression ratio, we also explore the trade-offs with respect to compression speed, dataset decompression speed, and average sequence decompression speed. With two of the surveyed algorithms, dataset decompression becomes faster and sequence decompression remains practical, at the cost of compression time, which remains competitive for the datasets with highest variability.

Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen, Hjalte Wedel Vildhøj, and Søren Vind. Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation. Algorithmica, 80(11):3207-3224, 2018. URL: https://doi.org/10.1007/s00453-017-0380-7.
Philip Bille, Inge Li Gørtz, and Máximo Pérez-López. Hierarchical Relative Compression reference implementation. Software, version 1.0., Supported by Danish Research Council grant 10.46540/3105-00302B (visited on 2026-05-29). URL: https://gitlab.gbar.dtu.dk/hierarchical-relative-compression/hrcimplementation
full metadata available at: https://doi.org/10.4230/artifacts.26213
Philip Bille and Inge Li Gørtz. Random Access in Persistent Strings. In Yixin Cao, Siu-Wing Cheng, and Minming Li, editors, 31st International Symposium on Algorithms and Computation (ISAAC 2020), volume 181 of Leibniz International Proceedings in Informatics (LIPIcs), pages 48:1-48:16, Dagstuhl, Germany, 2020. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2020.48.
Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Hierarchical Relative Lempel-Ziv Compression. In Loukas Georgiadis, editor, 21st International Symposium on Experimental Algorithms (SEA 2023), volume 265 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1-18:16, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.SEA.2023.18.
Grace A. Blackwell, Martin Hunt, Kerri M. Malone, Leandro Lima, Gal Horesh, Blaise T. F. Alako, Nicholas R. Thomson, and Zamin Iqbal. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology, 19(11):e3001421, 2021. URL: https://doi.org/10.1371/journal.pbio.3001421.
Grace A. Blackwell, Martin Hunt, Kerri M. Malone, Leandro Lima, Gal Horesh, Blaise T. F. Alako, Nicholas R. Thomson, and Zamin Iqbal. Index of /pub/databases/ENA2018-bacteria-661k/Assemblies, November 2021. URL: https://ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k/Assemblies/.
Karel Brinda. NCTC 3000 complete assemblies, 2021. URL: https://doi.org/10.5281/zenodo.4838517.
Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-Wise Independent Permutations. Journal of Computer and System Sciences, 60(3):630-659, 2000. URL: https://doi.org/10.1006/jcss.1999.1690.
A.Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pages 21-29, 1997. URL: https://doi.org/10.1109/SEQUEN.1997.666900.
Karel Břinda, Leandro Lima, Simone Pignotti, Natalia Quinones-Olvera, Kamil Salikhov, Rayan Chikhi, Gregory Kucherov, Zamin Iqbal, and Michael Baym. Efficient and robust search of microbial genomes via phylogenetic compression. Nature Methods, 22(4):692-697, 2025. URL: https://doi.org/10.1038/s41592-025-02625-2.
J. Cleary and I. Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Transactions on Communications, 32(4):396-402, 1984. URL: https://doi.org/10.1109/TCOM.1984.1096090.
Yann Collet. Cyan4973/FiniteStateEntropy, 2026. original-date: 2013-12-15T14:05:21Z. URL: https://github.com/Cyan4973/FiniteStateEntropy.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, 2015. URL: https://doi.org/10.1038/nature15393.
Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, and Richard Durbin. The variant call format and VCFtools. Bioinformatics, 27(15):2156-2158, 2011. URL: https://doi.org/10.1093/bioinformatics/btr330.
Sebastian Deorowicz, Agnieszka Danek, and Szymon Grabowski. Genome compression: a novel approach for large collections. Bioinformatics, 29(20):2572-2578, 2013. URL: https://doi.org/10.1093/bioinformatics/btt460.
Sebastian Deorowicz, Agnieszka Danek, and Marcin Niemiec. GDC 2: Compression of large collections of genomes. Scientific Reports, 5(1):11565, 2015. URL: https://doi.org/10.1038/srep11565.
Sebastian Deorowicz and Szymon Grabowski. Robust relative compression of genomes with random access. Bioinformatics, 27(21):2979-2986, 2011. URL: https://doi.org/10.1093/bioinformatics/btr505.
Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A Reliable Randomized Algorithm for the Closest-Pair Problem. Journal of Algorithms, 25(1):19-51, 1997. URL: https://doi.org/10.1006/jagm.1997.0873.
Huy Hoang Do, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Fast relative Lempel–Ziv self-index for similar sequences. Theoretical Computer Science, 532:14-30, 2014. URL: https://doi.org/10.1016/j.tcs.2013.07.024.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A Faster Grammar-Based Self-index. In Adrian-Horia Dediu and Carlos Martín-Vide, editors, Language and Automata Theory and Applications, pages 240-251, Berlin, Heidelberg, 2012. Springer. URL: https://doi.org/10.1007/978-3-642-28332-1_21.
Yonatan Grad. Data for "Genomic Epidemiology of Gonococcal Resistance to Extended-Spectrum Cephalosporins, Macrolides, and Fluoroquinolones in the United States, 2000-2013", 2019. URL: https://doi.org/10.5281/zenodo.2618836.
Christopher Hoobin, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow., 5(3):265-273, 2011. URL: https://doi.org/10.14778/2078331.2078341.
Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval. String Processing and Information Retrieval, 6393:201-206, 2010. URL: https://doi.org/10.1007/978-3-642-16321-0_20.
Kewen Liao, Matthias Petri, Alistair Moffat, and Anthony Wirth. Effective Construction of Relative Lempel-Ziv Dictionaries. In Proceedings of the 25th International Conference on World Wide Web, WWW '16, pages 807-816, Republic and Canton of Geneva, CHE, 2016. International World Wide Web Conferences Steering Committee. URL: https://doi.org/10.1145/2872427.2883042.
Yuansheng Liu, Hui Peng, Limsoon Wong, and Jinyan Li. High-speed and high-ratio referential genome compression. Bioinformatics, 33(21):3364-3372, 2017. URL: https://doi.org/10.1093/bioinformatics/btx412.
A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917-1921, 1990. URL: https://doi.org/10.1109/26.61469.
A. Moffat and A. Turpin. On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45(10):1200-1207, 1997. URL: https://doi.org/10.1109/26.634683.
Tommi Mäklin, Teemu Kallonen, Jarno Alanko, Veli Mäklinen, Jukka Corander, and Antti Honkela. mGEMS Escherichia coli reference dataset, 2020. URL: https://doi.org/10.5281/zenodo.3724112.
Tommi Mäklin, Teemu Kallonen, Jarno Alanko, Veli Mäklinen, Jukka Corander, and Antti Honkela. mGEMS Staphylococcus aureus reference dataset, 2020. URL: https://doi.org/10.5281/zenodo.3724135.
Gonzalo Navarro and Víctor Sepúlveda. Practical Indexing of Repetitive Collections Using Relative Lempel-Ziv. In 2019 Data Compression Conference (DCC), pages 201-210, 2019. ISSN: 2375-0359. URL: https://doi.org/10.1109/DCC.2019.00028.
NCBI. FASTA Format for Nucleotide Sequences. URL: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/.
NCBI. The NCBI Genome assembly data model. Section: data-processing. URL: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/data-processing/policies-annotation/data-model/.
Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. iDoComp: a compression scheme for assembled genomes. Bioinformatics, 31(5):626-633, 2015. URL: https://doi.org/10.1093/bioinformatics/btu698.
Brian D. Ondov, Gabriel J. Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck, and Adam M. Phillippy. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biology, 20(1):232, 2019. URL: https://doi.org/10.1186/s13059-019-1841-x.
Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132, 2016. URL: https://doi.org/10.1186/s13059-016-0997-x.
Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002., pages 257-266, 2002. URL: https://doi.org/10.1109/WISE.2002.1181662.
Dmitri S. Pavlichin, Tsachy Weissman, and Golan Yona. The human genome contracts again. Bioinformatics, 29(17):2199-2202, 2013. URL: https://doi.org/10.1093/bioinformatics/btt362.
Matthias Petri, Alistair Moffat, P. C. Nagesh, and Anthony Wirth. Access Time Tradeoffs in Archive Compression. In Guido Zuccon, Shlomo Geva, Hideo Joho, Falk Scholer, Aixin Sun, and Peng Zhang, editors, Information Retrieval Technology, pages 15-28, Cham, 2015. Springer International Publishing. URL: https://doi.org/10.1007/978-3-319-28940-3_2.
Subrata Saha and Sanguthevar Rajasekaran. ERGC: an efficient referential genome compression algorithm. Bioinformatics, 31(21):3468-3475, 2015. URL: https://doi.org/10.1093/bioinformatics/btv399.
Wei Shi, Jianhua Chen, Mao Luo, and Min Chen. High efficiency referential genome compression algorithm. Bioinformatics, 35(12):2058-2065, 2019. URL: https://doi.org/10.1093/bioinformatics/bty934.
Eric L. Stevens, Ruth Timme, Eric W. Brown, Marc W. Allard, Errol Strain, Kelly Bunning, and Steven Musser. The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes. Frontiers in Microbiology, 8, 2017. URL: https://doi.org/10.3389/fmicb.2017.00808.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, October 1982. URL: https://doi.org/10.1145/322344.322346.
Tao Tang, Yuansheng Liu, Buzhong Zhang, Benyue Su, and Jinyan Li. Sketch distance-based clustering of chromosomes for large genome database compression. BMC Genomics, 20(10):978, 2019. URL: https://doi.org/10.1186/s12864-019-6310-0.
R. E. Tarjan. Finding optimum branchings. Networks, 7(1):25-35, 1977. URL: https://doi.org/10.1002/net.3230070103.
Jiancong Tong, Anthony Wirth, and Justin Zobel. Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, pages 1629-1638, New York, NY, USA, 2014. Association for Computing Machinery. URL: https://doi.org/10.1145/2661829.2661961.
Jiancong Tong, Anthony Wirth, and Justin Zobel. Principled dictionary pruning for low-memory corpus compression. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, SIGIR '14, pages 283-292, New York, NY, USA, 2014. Association for Computing Machinery. URL: https://doi.org/10.1145/2600428.2609576.
A. Turpin and A. Moffat. Housekeeping for prefix coding. IEEE Transactions on Communications, 48(4):622-628, 2000. URL: https://doi.org/10.1109/26.843129.
Andrew Turpin. turpinandrew/shuff, 2024. original-date: 2017-04-03T01:08:56Z. URL: https://github.com/turpinandrew/shuff.

From Relative Compression to Hierarchical Compression

Authors Philip Bille , Inge Li Gørtz , Máximo Pérez-López

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

From Relative Compression to Hierarchical Compression

Authors Philip Bille , Inge Li Gørtz , Máximo Pérez-López

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message