Subset Wavelet Trees

Alanko, Jarno N.; Biagi, Elena; Puglisi, Simon J.; Vuohtoniemi, Jaakko

doi:10.4230/LIPIcs.SEA.2023.4

Abstract

Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.

Cite As Get BibTex

Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset Wavelet Trees. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.SEA.2023.4

Author Details

Jarno N. Alanko

Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland

Elena Biagi

Department of Computer Science, University of Helsinki, Finland

Simon J. Puglisi

Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland

Jaakko Vuohtoniemi

Department of Computer Science, University of Helsinki, Finland

Funding

This work was supported in part by the Academy of Finland via grants 339070 and 351150.

Acknowledgements

A brief description of the subset wavelet tree first appeared in a technical report by the authors [Alanko et al., 2022].

Supplementary Materials

Software (Source Code) https://github.com/jnalanko/SubsetWT
Software (Source Code) https://github.com/jnalanko/SubsetWT-Experiments

References

Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039-1051, 1987.
H. Alamro, M. Alzamel, C.S. Iliopoulos, S. P. Pissis, and S. Watts. IUPACpal: efficient identification of inverted repeats in IUPAC-encoded dna sequences. BMC Bioinformatics, 22(51), 2021.
Jarno N Alanko, Simon J Puglisi, and Jaakko Vuohtoniemi. Succinct k-mer sets using subset rank queries on the spectral Burrows-Wheeler transform. bioRxiv, 2022.
J. Barbay, M. He, I. Munro, and S. Srinivasa Rao. Succinct indexes for strings, binary relations and multilabeled trees. ACM Transactions on Algorithms, 7(4):article 52, 2011.
E. Cambouropoulos, T. Crawford, and C.S. Iliopoulos. Pattern processing in melodic sequences: Challenges, caveats and prospects. Computers and the Humanities, 35:9-21, 2001.
Rayan Chikhi. A tale of optimizing the space taken by de Bruijn graphs. In Proc. 17th Conference on Computability in Europe (CiE), volume 12813 of LNCS, pages 120-134. Springer, 2021.
Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Covering problems for partial words and for indeterminate strings. Theor. Comput. Sci., 698:25-39, 2017.
P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390-398. IEEE Computer Society, 2000.
Michael J. Fischer and Michael S. Paterson. String-matching and other products. Complexity of Computation, 7:113-125, 1974.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), LNCS 8504, pages 326-337. Springer, 2014.
Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287-1314, 2014.
A. Golynski, I. Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 368-373, 2006.
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), !booktitle = "SODA", pages 841-850, 2003.
Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020.
Jan Holub, William F. Smyth, and Shu Wang. Fast pattern-matching on indeterminate strings. J. Discrete Algorithms, 6(1):37-50, 2008.
Costas S. Iliopoulos and Jakub Radoszewski. Truly subquadratic-time extension queries and periodicity detection in strings with uncertainties. In Roberto Grossi and Moshe Lewenstein, editors, Proc. 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 8:1-8:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
Costas S. Iliopoulos, M. Sohel Rahman, Michal Vorácek, and Ladislav Vagner. The constrained longest common subsequence problem for degenerate strings. In Jan Holub and Jan Zdárek, editors, Proc. 12th International Conference on Implementation and Application of Automata (CIAA), LNCS 4783, pages 309-311. Springer, 2007.
IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents. Biochemistry, 9(20):4022-4027, 1970.
Ian B Jeffery, Anubhav Das, Eileen O’Herlihy, Simone Coughlan, Katryna Cisek, Michael Moore, Fintan Bradley, Tom Carty, Meenakshi Pradhan, Chinmay Dwibedi, et al. Differences in fecal microbiomes and metabolomes of people with vs without irritable bowel syndrome and bile acid malabsorption. Gastroenterology, 158(4):1016-1028, 2020.
Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, and André Kahles. Metagraph: Indexing and analysing nucleotide archives at petabase-scale. BioRxiv, 2020.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proceedings of the Data Compression Conference (DCC), pages 302-311, 2014.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015.
J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science, LNCS 1180, pages 37-42. Springer, 1996.
G. Navarro. Wavelet trees for all. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7354, pages 2-26, 2012.
Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016.
Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps. In Experimental Algorithms: 11th International Symposium, SEA 2012, Bordeaux, France, June 7-9, 2012. Proceedings 11, pages 295-306. Springer, 2012.
Rajeev Raman. Rank and select operations on bit strings. In Encyclopedia of Algorithms, pages 1772-1775. Springer, 2016. URL: https://doi.org/10.1007/978-1-4939-2864-4_332.
Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 233-242, 2002.
Sun Wu and Udi Manber. Agrep – a fast approximate pattern-matching tool. In Proc. USENIX Winter 1992 Technical Conference, pages 153-162, 1992.

Subset Wavelet Trees

Authors Jarno N. Alanko, Elena Biagi , Simon J. Puglisi , Jaakko Vuohtoniemi

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message