DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.17

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.10

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

Authors: Ali Ghaffaari, Alexander Schönhuth, and Tobias Marschall

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Determining the distance between two loci within a genomic region is a recurrent operation in various tasks in computational genomics. A notable example of this task arises in paired-end read mapping as a form of validation of distances between multiple alignments. While straightforward for a single genome, graph-based reference structures render the operation considerably more involved. Given the sheer number of such queries in a typical read mapping experiment, an efficient algorithm for answering distance queries is crucial. In this paper, we introduce DiVerG, a compact data structure as well as a fast and scalable algorithm, for constructing distance indexes for general sequence graphs on multi-core CPU and many-core GPU architectures. DiVerG is based on PairG [Jain et al., 2019], but overcomes the limitations of PairG by exploiting the extensive potential for improvements in terms of scalability and space efficiency. As a consequence, DiVerG can process substantially larger datasets, such as whole human genomes, which are unmanageable by PairG. DiVerG offers faster index construction time and consistently faster query time with gains proportional to the size of the underlying compact data structure. We demonstrate that our method performs favorably on multiple real datasets at various scales. DiVerG achieves superior performance over PairG; e.g. resulting to 2.5-4x speed-up in query time, 44-340x smaller index size, and 3-50x faster construction time for the genome graph of the MHC region, as a particularly variable region of the human genome. The implementation is available at: https://github.com/cartoonist/diverg

Cite as

Ali Ghaffaari, Alexander Schönhuth, and Tobias Marschall. DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 10:1-10:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ghaffaari_et_al:LIPIcs.WABI.2025.10,
  author =	{Ghaffaari, Ali and Sch\"{o}nhuth, Alexander and Marschall, Tobias},
  title =	{{DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{10:1--10:24},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.10},
  URN =		{urn:nbn:de:0030-drops-239369},
  doi =		{10.4230/LIPIcs.WABI.2025.10},
  annote =	{Keywords: Sequence graph, distance index, read mapping, sparse matrix}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.14

Human Readable Compression of GFA Paths Using Grammar-Based Code

Authors: Peter Heringer and Daniel Doerr

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Pangenome graphs offer a compact and comprehensive representation of genomic diversity, improving tasks such as variant calling, genotyping, and other downstream analyses. Although the underlying graph structures scale sublinearly with the number of haplotypes, the widely used GFA file format suffers from rapidly growing file sizes due to the explicit and repetitive encoding of haplotype paths. In this work, we introduce an extension to the GFA format that enables efficient grammar-based compression of haplotype paths while retaining human readability. In addition, grammar-based encoding provides an efficient in-memory data structure that does not require decompression, but conversely improves the runtime of many computational tasks that involve haplotype comparisons. We present sqz, a method that makes use of the proposed format extension to encode haplotype paths using byte pair encoding, a grammar-based compression scheme. We evaluate sqz on recent human pangenome graphs from Heumos et al. and the Human Pangenome Reference Consortium (HPRC), comparing it to existing compressors bgzip, gbz, and sequitur. sqz scales sublinearly with the number of haplotypes in a pangenome graph and consistently achieves higher compression ratios than sequitur and up to 5 times better compression than bgzip in HPRC graphs and up to 10 times in the graph from Heumos et al.. When combined with bgzip, sqz matches or excels the compression ratio of gbz across all our datasets. These results demonstrate the potential of our proposed extension of the GFA format in reducing haplotype path redundancy and improving storage efficiency for pangenome graphs.

Cite as

Peter Heringer and Daniel Doerr. Human Readable Compression of GFA Paths Using Grammar-Based Code. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{heringer_et_al:LIPIcs.WABI.2025.14,
  author =	{Heringer, Peter and Doerr, Daniel},
  title =	{{Human Readable Compression of GFA Paths Using Grammar-Based Code}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.14},
  URN =		{urn:nbn:de:0030-drops-239395},
  doi =		{10.4230/LIPIcs.WABI.2025.14},
  annote =	{Keywords: pangenomics, pangenome graphs, compression, grammar-based code, byte pair encoding}
}

Document

Academic Track

DOI: 10.4230/OASIcs.SAIA.2024.3

EAM Diagrams - A Framework to Systematically Describe AI Systems for Effective AI Risk Assessment (Academic Track)

Authors: Ronald Schnitzer, Andreas Hapfelmeier, and Sonja Zillner

Published in: OASIcs, Volume 126, Symposium on Scaling AI Assessments (SAIA 2024)

Abstract

Artificial Intelligence (AI) is a transformative technology that offers new opportunities across various applications. However, the capabilities of AI systems introduce new risks, which require the adaptation of established risk assessment procedures. A prerequisite for any effective risk assessment is a systematic description of the system under consideration, including its inner workings and application environment. Existing system description methodologies are only partially applicable to complex AI systems, as they either address only parts of the AI system, such as datasets or models, or do not consider AI-specific characteristics at all. In this paper, we present a novel framework called EAM Diagrams for the systematic description of AI systems, gathering all relevant information along the AI life cycle required to support a comprehensive risk assessment. The framework introduces diagrams on three levels, covering the AI system’s environment, functional inner workings, and the learning process of integrated Machine Learning (ML) models.

Cite as

Ronald Schnitzer, Andreas Hapfelmeier, and Sonja Zillner. EAM Diagrams - A Framework to Systematically Describe AI Systems for Effective AI Risk Assessment (Academic Track). In Symposium on Scaling AI Assessments (SAIA 2024). Open Access Series in Informatics (OASIcs), Volume 126, pp. 3:1-3:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{schnitzer_et_al:OASIcs.SAIA.2024.3,
  author =	{Schnitzer, Ronald and Hapfelmeier, Andreas and Zillner, Sonja},
  title =	{{EAM Diagrams - A Framework to Systematically Describe AI Systems for Effective AI Risk Assessment}},
  booktitle =	{Symposium on Scaling AI Assessments (SAIA 2024)},
  pages =	{3:1--3:16},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-357-7},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{126},
  editor =	{G\"{o}rge, Rebekka and Haedecke, Elena and Poretschkin, Maximilian and Schmitz, Anna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.SAIA.2024.3},
  URN =		{urn:nbn:de:0030-drops-227432},
  doi =		{10.4230/OASIcs.SAIA.2024.3},
  annote =	{Keywords: AI system description, AI risk assessment, AI auditability}
}

Document

DOI: 10.4230/OASIcs.WCET.2019.4

Worst-Case Energy-Consumption Analysis by Microarchitecture-Aware Timing Analysis for Device-Driven Cyber-Physical Systems

Authors: Phillip Raffeck, Christian Eichler, Peter Wägemann, and Wolfgang Schröder-Preikschat

Published in: OASIcs, Volume 72, 19th International Workshop on Worst-Case Execution Time Analysis (WCET 2019)

Abstract

Many energy-constrained cyber-physical systems require both timeliness and the execution of tasks within given energy budgets. That is, besides knowledge on worst-case execution time (WCET), the worst-case energy consumption (WCEC) of operations is essential. Unfortunately, WCET analysis approaches are not directly applicable for deriving WCEC bounds in device-driven cyber-physical systems: For example, a single memory operation can lead to a significant power-consumption increase when thereby switching on a device (e.g. transceiver, actuator) in the embedded system. However, as we demonstrate in this paper, existing approaches from microarchitecture-aware timing analysis (i.e. considering cache and pipeline effects) are beneficial for determining WCEC bounds: We extended our framework on whole-system analysis with microarchitecture-aware timing modeling to precisely account for the execution time that devices are kept (in)active. Our evaluations based on a benchmark generator, which is able to output benchmarks with known baselines (i.e. actual WCET and actual WCEC), and an ARM Cortex-M4 platform validate that the approach significantly reduces analysis pessimism in whole-system WCEC analyses.

Cite as

Phillip Raffeck, Christian Eichler, Peter Wägemann, and Wolfgang Schröder-Preikschat. Worst-Case Energy-Consumption Analysis by Microarchitecture-Aware Timing Analysis for Device-Driven Cyber-Physical Systems. In 19th International Workshop on Worst-Case Execution Time Analysis (WCET 2019). Open Access Series in Informatics (OASIcs), Volume 72, pp. 4:1-4:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{raffeck_et_al:OASIcs.WCET.2019.4,
  author =	{Raffeck, Phillip and Eichler, Christian and W\"{a}gemann, Peter and Schr\"{o}der-Preikschat, Wolfgang},
  title =	{{Worst-Case Energy-Consumption Analysis by Microarchitecture-Aware Timing Analysis for Device-Driven Cyber-Physical Systems}},
  booktitle =	{19th International Workshop on Worst-Case Execution Time Analysis (WCET 2019)},
  pages =	{4:1--4:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-118-4},
  ISSN =	{2190-6807},
  year =	{2019},
  volume =	{72},
  editor =	{Altmeyer, Sebastian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.WCET.2019.4},
  URN =		{urn:nbn:de:0030-drops-107699},
  doi =		{10.4230/OASIcs.WCET.2019.4},
  annote =	{Keywords: WCEC, WCRE, WCET, michroarchitecture analysis, whole-system analysis}
}

Document

DOI: 10.4230/OASIcs.WCET.2018.6

TASKers: A Whole-System Generator for Benchmarking Real-Time-System Analyses

Authors: Christian Eichler, Tobias Distler, Peter Ulbrich, Peter Wägemann, and Wolfgang Schröder-Preikschat

Published in: OASIcs, Volume 63, 18th International Workshop on Worst-Case Execution Time Analysis (WCET 2018)

Abstract

Implementation-based benchmarking of timing and schedulability analyses requires system code that can be executed on real hardware and has defined properties, for example, known worst-case execution times (WCETs) of tasks. Traditional approaches for creating benchmarks with such characteristics often result in implementations that do not resemble real-world systems, either due to work only being simulated by means of busy waiting, or because tasks have no control-flow dependencies between each other. In this paper, we address this problem with TASKers, a generator that constructs realistic benchmark systems with predefined properties. To achieve this, TASKers composes patterns of real-world programs to generate tasks that produce known outputs and exhibit preconfigured WCETs when being executed with certain inputs. Using this knowledge during the generation process, TASKers is able to specifically introduce inter-task control-flow dependencies by mapping the output of one task to the input of another.

Cite as

Christian Eichler, Tobias Distler, Peter Ulbrich, Peter Wägemann, and Wolfgang Schröder-Preikschat. TASKers: A Whole-System Generator for Benchmarking Real-Time-System Analyses. In 18th International Workshop on Worst-Case Execution Time Analysis (WCET 2018). Open Access Series in Informatics (OASIcs), Volume 63, pp. 6:1-6:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{eichler_et_al:OASIcs.WCET.2018.6,
  author =	{Eichler, Christian and Distler, Tobias and Ulbrich, Peter and W\"{a}gemann, Peter and Schr\"{o}der-Preikschat, Wolfgang},
  title =	{{TASKers: A Whole-System Generator for Benchmarking Real-Time-System Analyses}},
  booktitle =	{18th International Workshop on Worst-Case Execution Time Analysis (WCET 2018)},
  pages =	{6:1--6:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-073-6},
  ISSN =	{2190-6807},
  year =	{2018},
  volume =	{63},
  editor =	{Brandner, Florian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.WCET.2018.6},
  URN =		{urn:nbn:de:0030-drops-97528},
  doi =		{10.4230/OASIcs.WCET.2018.6},
  annote =	{Keywords: benchmarking real-time-system analyses, task-set generation, whole-system generation, static timing analysis, WCET analysis}
}

6 Search Results for "Eichler, Christian"

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Abstract

Cite as

DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

Abstract

Cite as

Human Readable Compression of GFA Paths Using Grammar-Based Code

Abstract

Cite as

EAM Diagrams - A Framework to Systematically Describe AI Systems for Effective AI Risk Assessment (Academic Track)

Abstract

Cite as

Worst-Case Energy-Consumption Analysis by Microarchitecture-Aware Timing Analysis for Device-Driven Cyber-Physical Systems

Abstract

Cite as

TASKers: A Whole-System Generator for Benchmarking Real-Time-System Analyses

Abstract

Cite as

Thanks for your feedback!

Could not send message