Search Results

Documents authored by Boucher, Christina


Document
Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage

Authors: Justin Kim, Rahul Varki, Marco Oliva, and Christina Boucher

Published in: LIPIcs, Volume 308, 32nd Annual European Symposium on Algorithms (ESA 2024)


Abstract
The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as Re²Pair), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare Re²Pair to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method Re²Pair achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. Re²Pair is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair

Cite as

Justin Kim, Rahul Varki, Marco Oliva, and Christina Boucher. Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 78:1-78:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{kim_et_al:LIPIcs.ESA.2024.78,
  author =	{Kim, Justin and Varki, Rahul and Oliva, Marco and Boucher, Christina},
  title =	{{Re²Pair: Increasing the Scalability of RePair by Decreasing Memory Usage}},
  booktitle =	{32nd Annual European Symposium on Algorithms (ESA 2024)},
  pages =	{78:1--78:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-338-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{308},
  editor =	{Chan, Timothy and Fischer, Johannes and Iacono, John and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2024.78},
  URN =		{urn:nbn:de:0030-drops-211496},
  doi =		{10.4230/LIPIcs.ESA.2024.78},
  annote =	{Keywords: RePair, Compressed Data Structures, Prefix-free Parsing}
}
Document
Solving the Minimal Positional Substring Cover Problem in Sublinear Space

Authors: Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem.

Cite as

Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{bonizzoni_et_al:LIPIcs.CPM.2024.12,
  author =	{Bonizzoni, Paola and Boucher, Christina and Cozzi, Davide and Gagie, Travis and Pirola, Yuri},
  title =	{{Solving the Minimal Positional Substring Cover Problem in Sublinear Space}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{12:1--12:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.12},
  URN =		{urn:nbn:de:0030-drops-201225},
  doi =		{10.4230/LIPIcs.CPM.2024.12},
  annote =	{Keywords: Positional Burrows-Wheeler Transform, r-index, minimal position substring cover, set-maximal exact matches}
}
Document
Acceleration of FM-Index Queries Through Prefix-Free Parsing

Authors: Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)


Abstract
FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.

Cite as

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of FM-Index Queries Through Prefix-Free Parsing. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{hong_et_al:LIPIcs.WABI.2023.13,
  author =	{Hong, Aaron and Oliva, Marco and K\"{o}ppl, Dominik and Bannai, Hideo and Boucher, Christina and Gagie, Travis},
  title =	{{Acceleration of FM-Index Queries Through Prefix-Free Parsing}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{13:1--13:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.13},
  URN =		{urn:nbn:de:0030-drops-186390},
  doi =		{10.4230/LIPIcs.WABI.2023.13},
  annote =	{Keywords: FM-index, pangenomics, scalability, word-based indexing, random access}
}
Document
Complete Volume
LIPIcs, Volume 242, WABI 2022, Complete Volume

Authors: Christina Boucher and Sven Rahmann

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
LIPIcs, Volume 242, WABI 2022, Complete Volume

Cite as

22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 1-474, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@Proceedings{boucher_et_al:LIPIcs.WABI.2022,
  title =	{{LIPIcs, Volume 242, WABI 2022, Complete Volume}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{1--474},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022},
  URN =		{urn:nbn:de:0030-drops-170338},
  doi =		{10.4230/LIPIcs.WABI.2022},
  annote =	{Keywords: LIPIcs, Volume 242, WABI 2022, Complete Volume}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Christina Boucher and Sven Rahmann

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 0:i-0:xii, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{boucher_et_al:LIPIcs.WABI.2022.0,
  author =	{Boucher, Christina and Rahmann, Sven},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{0:i--0:xii},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.0},
  URN =		{urn:nbn:de:0030-drops-170347},
  doi =		{10.4230/LIPIcs.WABI.2022.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph

Authors: Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, and Christina Boucher

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)


Abstract
Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was the only one able to successfully run on all three genomes. The method of Valouev et al. (2006) only successfully ran on E. coli and Bionano Solve successfully ran on E. coli and human but not on the fish genome. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies.

Cite as

Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, and Christina Boucher. Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Copy BibTex To Clipboard

@InProceedings{mukherjee_et_al:LIPIcs.WABI.2020.9,
  author =	{Mukherjee, Kingshuk and Rossi, Massimiliano and Salmela, Leena and Boucher, Christina},
  title =	{{Fast and Efficient Rmap Assembly Using the Bi-Labelled de Bruijn Graph}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{9:1--9:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.9},
  URN =		{urn:nbn:de:0030-drops-127982},
  doi =		{10.4230/LIPIcs.WABI.2020.9},
  annote =	{Keywords: optical maps, de Bruijn graph, assembly}
}
Document
Prefix-Free Parsing for Building Big BWTs

Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)


Abstract
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Cite as

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{boucher_et_al:LIPIcs.WABI.2018.2,
  author =	{Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Manzini, Giovanni},
  title =	{{Prefix-Free Parsing for Building Big BWTs}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{2:1--2:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.2},
  URN =		{urn:nbn:de:0030-drops-93044},
  doi =		{10.4230/LIPIcs.WABI.2018.2},
  annote =	{Keywords: Burrows-Wheeler Transform, prefix-free parsing, compression-aware algorithms, genomic databases}
}
Document
A Succinct Solution to Rmap Alignment

Authors: Martin D. Muggli, Simon J. Puglisi, and Christina Boucher

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)


Abstract
We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Lastly, we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time. Kohdista is available at https://github.com/mmuggli/KOHDISTA/.

Cite as

Martin D. Muggli, Simon J. Puglisi, and Christina Boucher. A Succinct Solution to Rmap Alignment. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{muggli_et_al:LIPIcs.WABI.2018.12,
  author =	{Muggli, Martin D. and Puglisi, Simon J. and Boucher, Christina},
  title =	{{A Succinct Solution to Rmap Alignment}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{12:1--12:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.12},
  URN =		{urn:nbn:de:0030-drops-93143},
  doi =		{10.4230/LIPIcs.WABI.2018.12},
  annote =	{Keywords: Optical mapping, index based data structures, FM-index, graph algorithms}
}
Document
Disentangled Long-Read De Bruijn Graphs via Optical Maps

Authors: Bahar Alipanahi, Leena Salmela, Simon J. Puglisi, Martin Muggli, and Christina Boucher

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)


Abstract
While long reads produced by third-generation sequencing technology from, e.g, Pacific Biosciences have been shown to increase the quality of draft genomes in repetitive regions, fundamental computational challenges remain in overcoming their high error rate and assembling them efficiently. In this paper we show that the de Bruijn graph built on the long reads can be efficiently and substantially disentangled using optical mapping data as auxiliary information. Fundamental to our approach is the use of the positional de Bruijn graph and a succinct data structure for constructing and traversing this graph. Our experimental results show that over 97.7% of directed cycles have been removed from the resulting positional de Bruijn graph as compared to its non-positional counterpart. Our results thus indicate that disentangling the de Bruijn graph using positional information is a promising direction for developing a simple and efficient assembly algorithm for long reads.

Cite as

Bahar Alipanahi, Leena Salmela, Simon J. Puglisi, Martin Muggli, and Christina Boucher. Disentangled Long-Read De Bruijn Graphs via Optical Maps. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 1:1-1:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{alipanahi_et_al:LIPIcs.WABI.2017.1,
  author =	{Alipanahi, Bahar and Salmela, Leena and Puglisi, Simon J. and Muggli, Martin and Boucher, Christina},
  title =	{{Disentangled Long-Read De Bruijn Graphs via Optical Maps}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{1:1--1:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.1},
  URN =		{urn:nbn:de:0030-drops-76614},
  doi =		{10.4230/LIPIcs.WABI.2017.1},
  annote =	{Keywords: Positional de Bruijn graph, Genome Assembly, Long Read Data, Optical maps}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail