Search Results

Documents authored by Varki, Rahul


Document
Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage

Authors: Justin Kim, Rahul Varki, Marco Oliva, and Christina Boucher

Published in: LIPIcs, Volume 308, 32nd Annual European Symposium on Algorithms (ESA 2024)


Abstract
The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as Re²Pair), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare Re²Pair to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method Re²Pair achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. Re²Pair is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair

Cite as

Justin Kim, Rahul Varki, Marco Oliva, and Christina Boucher. Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 78:1-78:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{kim_et_al:LIPIcs.ESA.2024.78,
  author =	{Kim, Justin and Varki, Rahul and Oliva, Marco and Boucher, Christina},
  title =	{{Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage}},
  booktitle =	{32nd Annual European Symposium on Algorithms (ESA 2024)},
  pages =	{78:1--78:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-338-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{308},
  editor =	{Chan, Timothy and Fischer, Johannes and Iacono, John and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2024.78},
  URN =		{urn:nbn:de:0030-drops-211496},
  doi =		{10.4230/LIPIcs.ESA.2024.78},
  annote =	{Keywords: RePair, Compressed Data Structures, Prefix-free Parsing}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail