Computing Maximal Unique Matches with the r-Index

Giuliani, Sara; Romana, Giuseppe; Rossi, Massimiliano

doi:10.4230/LIPIcs.SEA.2022.22

Abstract

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion.
In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r-index, while preserving the space and time bounds. We add additional O(r) samples of the longest common prefix (LCP) array, where r is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call MUM-PHINDER, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

Cite As Get BibTex

Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi. Computing Maximal Unique Matches with the r-Index. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 22:1-22:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.SEA.2022.22

Author Details

Sara Giuliani

Department of Computer Science, University of Verona, Italy

Giuseppe Romana

Department of Computer Science, University of Palermo, Italy

Massimiliano Rossi

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Funding

Rossi, Massimiliano: National Science Foundation NSF EAGER (Grant No. 2118251), and National Institutes of Health (NIH) NIAID (Grant No. HG011392).

Acknowledgements

We thank Travis Gagie for suggesting this problem as a project for his course CSCI 6905 at Dalhousie University. We also thank the anonymous reviewers for their insightful comments.

Supplementary Materials

Software https://github.com/saragiuliani/mum-phinder browse archived version

References

Joel Armstrong, Glenn Hickey, Mark Diekhans, Ian T. Fiddes, Adam M. Novak, Alden Deran, Qi Fang, Duo Xie, Shaohong Feng, et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587(7833):246-251, 2020.
Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020.
Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, and Massimiliano Rossi. PHONI: streamed matching statistics with multi-genome references. In Proceedings of 2021 Data Compression Conference DCC, pages 193-202. IEEE, 2021.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. Technical report, DIGITAL SRC RESEARCH REPORT, 1994.
Aaron C. E. Darling, Bob Mau, Frederick R. Blattner, and Nicole T. Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res., 14(7):1394-1403, 2004.
Aaron E. Darling, Bob Mau, and Nicole T. Perna. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One, 5(6):e11147, 2010.
Marc Deloger, Meriem El Karoui, and Marie-Agnès Petit. A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J. Bacteriol., 191(1):91-99, 2009.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In In Proceedings 41st annual Symposium on Foundations of Computer ScienceFOCS, pages 390-398. IEEE Computer Society, 2000.
Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, and Yoshimasa Takabatake. Practical Random Access to SLP-Compressed Texts. In Proceedings of the 27th International Symposium on String Processing and Information Retrieval (SPIRE 2020), volume 12303 of LNCS, pages 221-231. Springer, 2020.
Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. Rpair: Rescaling RePair with Rsync. In String Processing and Information Retrieval - 26th International Symposium, SPIRE 2019, volume 11811 of LNCS, pages 35-44. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_3.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020.
Peter W. Harrison, Rodrigo Lopez, Nadim Rahman, Stefan Gutnick Allen, Raheela Aslam, Nicola Buso, Carla Cummins, Yasmin Fathy, Eloy Felix, et al. The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Research, 49(W1):W619-W623, 2021.
Stefan Kurtz, Adam Phillippy, Arthur L. Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L. Salzberg. Versatile and open software for comparing large genomes. Genome Biol., 5(2):R12, 2004.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):R25, 2009.
Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589-595, 2010.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015.
Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey Zimin. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol., 14(1):e1005944, 2018.
Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, Nicolas Altemose, Lev Uralsky, et al. The complete sequence of a human genome. bioRxiv, 2021.
Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J. Comput. Biol., January 2022.
Jouni Sirén, Jean Monlong, Xian Chang, Adam M. Novak, Jordan M. Eizenga, Charles Markello, Jonas A. Sibbesen, Glenn Hickey, Pi-Chuan Chang, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science, 374(6574):abg8871, 2021.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, pages 68-74, 2015.
Kaiyuan Zhu, Welles Robinson, Alejandro A. Schäffer, Junyan Xu, Eytan Ruppin, A. Funda Ergun, Yuzhen Ye, and S. Cenk Sahinalp. Strain Level Microbial Detection and Quantification with Applications to Single Cell Metagenomics. bioRxiv, page 2020.06.12.149245, 2020.

Computing Maximal Unique Matches with the r-Index

Authors Sara Giuliani , Giuseppe Romana , Massimiliano Rossi

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Computing Maximal Unique Matches with the r-Index

Authors Sara Giuliani , Giuseppe Romana , Massimiliano Rossi

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message