DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.8

Lossless Pangenome Indexing Using Tag Arrays

Authors: Parsa Eskandar, Benedict Paten, and Jouni Sirén

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due to the scale and complexity of pangenomic data. We present a practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows-Wheeler transform (BWT) with graph coordinates. Our method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations where a query pattern appears. We introduce a novel construction algorithm that combines unique k-mers, graph-based extensions, and haplotype traversal to compute the tag array in a memory-efficient manner. To support large genomes, we process each chromosome independently and then merge the results into a unified index using properties of the multi-string BWT and r-index. Our evaluation on the HPRC graphs demonstrates that the tag array structure compresses effectively, scales well with added haplotypes, and preserves accurate mapping information across diverse regions of the genome. This indexing method enables lossless and haplotype-aware querying in complex pangenomes and offers a practical indexing layer to develop scalable aligners and downstream graph-based analysis tools.

Cite as

Parsa Eskandar, Benedict Paten, and Jouni Sirén. Lossless Pangenome Indexing Using Tag Arrays. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 8:1-8:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{eskandar_et_al:LIPIcs.WABI.2025.8,
  author =	{Eskandar, Parsa and Paten, Benedict and Sir\'{e}n, Jouni},
  title =	{{Lossless Pangenome Indexing Using Tag Arrays}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{8:1--8:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.8},
  URN =		{urn:nbn:de:0030-drops-239348},
  doi =		{10.4230/LIPIcs.WABI.2025.8},
  annote =	{Keywords: pangenome indexing, Burrows-Wheeler transform, tag arrays}
}

Document

DOI: 10.4230/OASIcs.Manzini.13

Graph Indexing Beyond Wheeler Graphs

Authors: Jarno N. Alanko, Elena Biagi, Massimo Equi, Veli Mäkinen, Simon J. Puglisi, Nicola Rizzo, Kunihiko Sadakane, and Jouni Sirén

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

After the discovery of the FM index, which linked the Burrows-Wheeler transform (BWT) to pattern matching on strings, several contemporaneous strands of research began on indexing more complex structures with the BWT, such as tries, finite languages, de Bruijn graphs, and aligned sequences. These directions can now be viewed as culminating in the theory of Wheeler Graphs, but sometimes they go beyond. This chapter reviews the significant body of "proto Wheeler Graph" indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice.

Cite as

Jarno N. Alanko, Elena Biagi, Massimo Equi, Veli Mäkinen, Simon J. Puglisi, Nicola Rizzo, Kunihiko Sadakane, and Jouni Sirén. Graph Indexing Beyond Wheeler Graphs. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 13:1-13:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:OASIcs.Manzini.13,
  author =	{Alanko, Jarno N. and Biagi, Elena and Equi, Massimo and M\"{a}kinen, Veli and Puglisi, Simon J. and Rizzo, Nicola and Sadakane, Kunihiko and Sir\'{e}n, Jouni},
  title =	{{Graph Indexing Beyond Wheeler Graphs}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{13:1--13:29},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.13},
  URN =		{urn:nbn:de:0030-drops-239215},
  doi =		{10.4230/OASIcs.Manzini.13},
  annote =	{Keywords: indexing, compression, compressed data structures, string algorithms, pattern matching}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.4

Haplotype-aware graph indexes

Authors: Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Cite as

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin. Haplotype-aware graph indexes. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{siren_et_al:LIPIcs.WABI.2018.4,
  author =	{Sir\'{e}n, Jouni and Garrison, Erik and Novak, Adam M. and Paten, Benedict J. and Durbin, Richard},
  title =	{{Haplotype-aware graph indexes}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{4:1--4:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.4},
  URN =		{urn:nbn:de:0030-drops-93060},
  doi =		{10.4230/LIPIcs.WABI.2018.4},
  annote =	{Keywords: FM-indexes, variation graphs, haplotypes}
}

Document

DOI: 10.4230/DagSemProc.08261.10

Storage and Retrieval of Individual Genomes

Authors: Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki

Published in: Dagstuhl Seminar Proceedings, Volume 8261, Structure-Based Compression of Complex Massive Data (2008)

Abstract

A repetitive sequence collection is one where portions of a emph{base sequence} of length $n$ are repeated many times with small variations, forming a collection of total length $N$. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies $O(N log N)$ bits, which very soon inhibits in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to $O(N log sigma)$ bits, where $sigma$ is the alphabet size. In practice, the space reduction is more than $10$-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length $n$ of the base sequence and the number $s$ of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on $N/n$. We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

Cite as

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and Retrieval of Individual Genomes. In Structure-Based Compression of Complex Massive Data. Dagstuhl Seminar Proceedings, Volume 8261, pp. 1-14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{makinen_et_al:DagSemProc.08261.10,
  author =	{M\"{a}kinen, Veli and Navarro, Gonzalo and Sir\'{e}n, Jouni and V\"{a}lim\"{a}ki, Niko},
  title =	{{Storage and Retrieval of Individual Genomes}},
  booktitle =	{Structure-Based Compression of Complex Massive Data},
  pages =	{1--14},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8261},
  editor =	{Stefan B\"{o}ttcher and Markus Lohrey and Sebastian Maneth and Wojcieh Rytter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08261.10},
  URN =		{urn:nbn:de:0030-drops-16743},
  doi =		{10.4230/DagSemProc.08261.10},
  annote =	{Keywords: Pattern matching, text indexing, compressed data structures, comparative genomics}
}

Search Results

Documents authored by Sirén, Jouni

Lossless Pangenome Indexing Using Tag Arrays

Abstract

Cite as

Graph Indexing Beyond Wheeler Graphs

Abstract

Cite as

Haplotype-aware graph indexes

Abstract

Cite as

Storage and Retrieval of Individual Genomes

Abstract

Cite as

Thanks for your feedback!

Could not send message