Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage

Authors Justin Kim , Rahul Varki , Marco Oliva , Christina Boucher



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.78.pdf
  • Filesize: 1.49 MB
  • 15 pages

Document Identifiers

Author Details

Justin Kim
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Rahul Varki
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Marco Oliva
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Christina Boucher
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Cite AsGet BibTex

Justin Kim, Rahul Varki, Marco Oliva, and Christina Boucher. Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 78:1-78:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ESA.2024.78

Abstract

The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as Re²Pair), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare Re²Pair to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method Re²Pair achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. Re²Pair is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair

Subject Classification

ACM Subject Classification
  • Theory of computation → Formal languages and automata theory
Keywords
  • RePair
  • Compressed Data Structures
  • Prefix-free Parsing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. Spumoni 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023. Google Scholar
  2. Djamal Belazzougui, Patrick Hagge Cording, Simon J Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In Algorithms-ESA 2015: 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, pages 142-154. Springer, 2015. Google Scholar
  3. Philip Bille, Inge Li Gørtz, and Nicola Prezza. Practical and effective Re-Pair compression. arXiv preprint arXiv:1704.08558, 2017. Google Scholar
  4. Philip Bille, Gad M Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513-539, 2015. Google Scholar
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms in Molecular Biology, 14(1):13:1-13:15, 2019. Google Scholar
  6. Marta Byrska-Bishop, Uday S Evani, Xuefang Zhao, Anna O Basile, Haley J Abel, Allison A Regier, André Corvelo, Wayne E Clarke, Rajeeva Musunuri, Kshithija Nagulapalli, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell, 185(18):3426-3440, 2022. Google Scholar
  7. Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. Google Scholar
  8. Francisco Claude, Antonio Farina, Miguel A Martínez-Prieto, and Gonzalo Navarro. Compressed q-gram indexing for highly repetitive biological sequences. In 2010 IEEE International Conference on BioInformatics and BioEngineering, pages 86-91. IEEE, 2010. Google Scholar
  9. Francisco Claude and Gonzalo Navarro. Fast and compact web graph representations. ACM Transactions on the Web (TWEB), 4(4):1-31, 2010. Google Scholar
  10. Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. Rpair: Rescaling RePair with Rsync. In International Symposium on String Processing and Information Retrieval, pages 35-44. Springer, 2019. Google Scholar
  11. Rodrigo González and Gonzalo Navarro. Compressed text indexes with fast locate. In Annual Symposium on Combinatorial Pattern Matching, pages 216-227. Springer, 2007. Google Scholar
  12. N Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000. Google Scholar
  13. Markus Lohrey, Sebastian Maneth, and Roy Mennicke. XML tree structure compression using RePair. Information Systems, 38(8):1150-1167, 2013. Google Scholar
  14. Takuya Mieno, Shunsuke Inenaga, and Takashi Horiyama. RePair grammars are the smallest grammars for fibonacci words. In 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl Publishing, June 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.26.
  15. Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O Twardziok, Alexander Kanitz, et al. Sustainable data analysis with Snakemake. F1000Research, 10, 2021. Google Scholar
  16. Gonzalo Navarro and Luís Manuel Silveira Russo. Re-Pair achieves high-order entropy. In DCC, page 537, 2008. Google Scholar
  17. Marco Oliva, Travis Gagie, and Christina Boucher. Recursive Prefix-Free Parsing for Building Big BWTs. In 2023 Data Compression Conference (DCC), pages 62-70. IEEE, 2023. Google Scholar
  18. Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J Hoyt, Dylan J Taylor, Nicolas Altemose, Paul W Hook, Sergey Koren, Mikko Rautiainen, Ivan A Alexandrov, et al. The complete sequence of a human y chromosome. Nature, 621(7978):344-354, 2023. Google Scholar
  19. Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. Moni: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 2022. Google Scholar
  20. James A Storer and Thomas G Szymanski. Data compression via textual substitution. Journal of the ACM (JACM), 29(4):928-951, 1982. Google Scholar
  21. Yasuo Tabei, Yoshimasa Takabatake, and Hiroshi Sakamoto. A succinct grammar compression. In Annual Symposium on Combinatorial Pattern Matching, pages 235-246. Springer, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail