The TAG Array of a Multiple Sequence Alignment

Olbrich, Jannik; Ohlebusch, Enno

doi:10.4230/LIPIcs.CPM.2026.29

Abstract

Modern genomic analyses increasingly rely on pangenomes, that is, representations of the genome of entire populations. The simplest representation of a pangenome is a set of individual genome sequences. Compared to e.g. sequence graphs, this has the advantage that efficient exact search via indexes based on the Burrows-Wheeler Transform (BWT) is possible, that no chimeric sequences are created, and that the results are not influenced by heuristics. However, such an index may report a match in thousands of positions even if these all correspond to the same locus, making downstream analysis unnecessarily more expensive. For sufficiently similar sequences (e.g. human chromosomes), a multiple sequence alignment (MSA) can be computed. Since an MSA tends to group similar strings in the same columns, it is likely that a string occurring thousands of times in the pangenome can be described by very few columns in the MSA. We describe a method to tag entries in the BWT with the corresponding column in the MSA and develop an index that can map matches in the BWT to columns in the MSA in time proportional to the output. As a by-product, we can project a match to a designated reference genome, a capability that current pangenome aligners lack.

Francesco Andreace, Pierre Lechat, Yoann Dufresne, and Rayan Chikhi. Comparing methods for constructing and representing human pangenome graphs. Genome Biology, 24(1):274, 2023.
Jasmijn A Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing, 21(1):81-108, 2022. URL: https://doi.org/10.1007/s11047-022-09882-6.
Andrej Baláž, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, and Jouni Sirén. Wheeler maps. In José A. Soto and Andreas Wiese, editors, LATIN 2024: Theoretical Informatics, pages 178-192, Cham, 2024. Springer Nature Switzerland. URL: https://doi.org/10.1007/978-3-031-55598-5_12.
Hideo Bannai, Travis Gagie, et al. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. URL: https://doi.org/10.1016/J.TCS.2019.08.005.
Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In Leo Liberti, editor, 22nd International Symposium on Experimental Algorithms (SEA 2024), volume 301 of Leibniz International Proceedings in Informatics (LIPIcs), pages 1:1-1:19, Dagstuhl, Germany, 2024. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.SEA.2024.1.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14:1-15, 2019. URL: https://doi.org/10.1186/s13015-019-0148-5.
Christina Boucher, Travis Gagie, I Tomohiro, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, and Massimiliano Rossi. PHONI: Streamed matching statistics with multi-genome references. In 2021 Data Compression Conference (DCC), pages 193-202. IEEE, 2021. URL: https://doi.org/10.1109/DCC50243.2021.00027.
Thomas Büchler, Jannik Olbrich, and Enno Ohlebusch. Efficient short read mapping to a pangenome that is represented by a graph of ED strings. Bioinformatics, 39(5):btad320, 2023. URL: https://doi.org/10.1093/BIOINFORMATICS/BTAD320.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. Digital SRC Research Report, 124, 1994.
Davide Cenzato and Zsuzsanna Lipták. A survey of BWT variants for string collections. Bioinformatics, 40(7):btae333, 2024. URL: https://doi.org/10.1093/bioinformatics/btae333.
Maria Chatzou, Cedrik Magis, Jia-Ming Chang, Carsten Kemena, Giovanni Bussotti, Ionas Erb, and Cedric Notredame. Multiple sequence alignment modeling: methods and applications. Briefings in Bioinformatics, 17(6):1009-1023, 2016. URL: https://doi.org/10.1093/BIB/BBV099.
Dustin Cobas, Travis Gagie, and Gonzalo Navarro. Fast and Small Subsampled R-indexes. ACM Transactions on Algorithms, 22(1):1-29, 2025.
Lore Depuydt, Omar Y Ahmed, Jan Fostier, Ben Langmead, and Travis Gagie. Run-length compressed metagenomic read classification with SMEM-finding and tagging. iScience, 2025. URL: https://doi.org/10.1016/j.isci.2025.114029.
Diego Díaz-Domínguez and Gonzalo Navarro. Efficient Construction of the BWT for Repetitive Text Using String Compression. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), volume 223 of Leibniz International Proceedings in Informatics (LIPIcs), pages 29:1-29:18, Dagstuhl, Germany, 2022. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.29.
Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974. URL: https://doi.org/10.1145/321812.321820.
Massimo Equi, Veli Mäkinen, and Alexandru I Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. Theoretical Computer Science, 975:114128, 2023. URL: https://doi.org/10.1016/j.tcs.2023.114128.
Parsa Eskandar, Benedict Paten, and Jouni Sirén. Lossless Pangenome Indexing Using Tag Arrays. bioRxiv, pages 2025-05, 2025. URL: https://doi.org/10.1101/2025.05.12.653561.
Robert Mario Fano. On the number of bits required to implement an associative memory. Massachusetts Institute of Technology, Project MAC, 1971.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science, pages 390-398. IEEE, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
Johannes Fischer. Optimal succinctness for range minimum queries. In Latin American Symposium on Theoretical Informatics, pages 158-169. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-12200-2_16.
Travis Gagie. Tag arrays. arXiv preprint arXiv:2411.15291, 2024. URL: https://doi.org/10.48550/arXiv.2411.15291.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1459-1477. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.96.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875-879, 2018.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. Practical implementation of rank and select queries. In 4th Workshop on Efficient and Experimental Algorithms, pages 27-38. CTI Press and Ellinika Grammata Greece, 2005.
Daehwan Kim, Joseph M Paggi, Chanhee Park, Christopher Bennett, and Steven L Salzberg. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 37(8):907-915, 2019.
Heng Li. BWT construction and search at the terabase scale. Bioinformatics, 40(12):btae717, November 2024. URL: https://doi.org/10.1093/bioinformatics/btae717.
Francesco Masillo. Matching Statistics Speed up BWT Construction. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 83:1-83:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2023.83.
Shanmugavelayutham Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete algorithms, pages 657-666, 2002.
Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 101:1-101:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.101.
Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025), volume 351 of Leibniz International Proceedings in Informatics (LIPIcs), pages 60:1-60:19, Dagstuhl, Germany, 2025. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2025.60.
Jannik Olbrich, Thomas Büchler, and Enno Ohlebusch. Generating multiple alignments on a pangenomic scale. Bioinformatics, 41(3):btaf104, March 2025. URL: https://doi.org/10.1093/bioinformatics/btaf104.
Marco Oliva, Travis Gagie, and Christina Boucher. Recursive prefix-free parsing for building big BWTs. In 2023 Data Compression Conference, pages 62-70. IEEE, 2023. URL: https://doi.org/10.1109/DCC55655.2023.00014.
Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms, 5(1):12-22, 2007. URL: https://doi.org/10.1016/j.jda.2006.03.011.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. URL: https://doi.org/10.1109/TCBB.2013.2297101.
Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. Journal of the ACM, 32(3):652-686, 1985. URL: https://doi.org/10.1145/3828.3835.
Robert Endre Tarjan. Data structures and network algorithms. SIAM, 1983. URL: https://doi.org/10.1137/1.9781611970265.
Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences, 102(39):13950-13955, 2005.
Rahul Varki, Massimiliano Rossi, Eddie Ferro, Marco Oliva, Erik Garrison, Ben Langmead, and Christina Boucher. Accurate short-read alignment through r-index-based pangenome indexing. Genome Research, 35(7):1609-1620, 2025. URL: https://doi.org/10.1101/gr.279858.124.

The TAG Array of a Multiple Sequence Alignment

Authors Jannik Olbrich , Enno Ohlebusch

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

The TAG Array of a Multiple Sequence Alignment

Authors Jannik Olbrich , Enno Ohlebusch

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message