Search Results

Documents authored by Romana, Giuseppe


Document
On the Impact of Morphisms on BWT-Runs

Authors: Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)


Abstract
Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms μ for which it is possible to find words w such that the difference r_bwt(μ(w))-r_bwt(w) is arbitrarily large.

Cite as

Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. On the Impact of Morphisms on BWT-Runs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 10:1-10:18, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{fici_et_al:LIPIcs.CPM.2023.10,
  author =	{Fici, Gabriele and Romana, Giuseppe and Sciortino, Marinella and Urbina, Cristian},
  title =	{{On the Impact of Morphisms on BWT-Runs}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{10:1--10:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.10},
  URN =		{urn:nbn:de:0030-drops-179641},
  doi =		{10.4230/LIPIcs.CPM.2023.10},
  annote =	{Keywords: Morphism, Burrows-Wheeler transform, Sturmian word, Sturmian morphism, Thue-Morse morphism, Repetitiveness measure}
}
Document
Computing Maximal Unique Matches with the r-Index

Authors: Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi

Published in: LIPIcs, Volume 233, 20th International Symposium on Experimental Algorithms (SEA 2022)


Abstract
In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call MUM-PHINDER, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

Cite as

Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi. Computing Maximal Unique Matches with the r-Index. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 22:1-22:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{giuliani_et_al:LIPIcs.SEA.2022.22,
  author =	{Giuliani, Sara and Romana, Giuseppe and Rossi, Massimiliano},
  title =	{{Computing Maximal Unique Matches with the r-Index}},
  booktitle =	{20th International Symposium on Experimental Algorithms (SEA 2022)},
  pages =	{22:1--22:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-251-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{233},
  editor =	{Schulz, Christian and U\c{c}ar, Bora},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2022.22},
  URN =		{urn:nbn:de:0030-drops-165568},
  doi =		{10.4230/LIPIcs.SEA.2022.22},
  annote =	{Keywords: Burrows-Wheeler Transform, r-index, maximal unique matches, bioinformatics, pangenomics}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail