Genotype phasing - the process of reconstructing haplotypes from genotype data - is a fundamental problem in genomics with applications in ancestry inference, imputation, and disease association. Traditional phasing methods rely on statistical models or combinatorial approaches which can be computationally expensive, particularly when applied to large-scale reference panels. In this paper, we present a first exploration of using the μ-PBWT (a run-length encoded Positional Burrows-Wheeler Transform) to solve the genotype phasing problem with a reference panel. Leveraging our previous results on positional substrings, we propose an approach that can explain a query genotype if the corresponding haplotype pair exists in the input panel. Moreover, our method is extended to cases where such a pair does not exist, even though some regions should remain unphased if they cannot be explicitly explained using the reference panel. We implemented this method and compared it against Beagle, a state-of-the-art phasing tool, demonstrating that, in the absence of mutations and recombinations, our approach correctly identifies the haplotype pair that explains a genotype query while using seven times less memory than Beagle. However, we also observe that as mutation rates increase, the quality of the phasing decreases as a result of the growing difficulty of identifying consistent haplotype pairs in the presence of sequence variation. These findings highlight the potential of μ-PBWT as an efficient alternative for genotype phasing, particularly in settings where computational resources are limited. The source code is publicly available at https://github.com/dlcgold/muPBWT/tree/phase.
@InProceedings{cozzi_et_al:OASIcs.Manzini.10, author = {Cozzi, Davide and Bonizzoni, Paola and Boucher, Christina and Langmead, Ben and Pirola, Yuri}, title = {{Phasing Data from Genotype Queries via the \mu-PBWT}}, booktitle = {The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday}, pages = {10:1--10:17}, series = {Open Access Series in Informatics (OASIcs)}, ISBN = {978-3-95977-390-4}, ISSN = {2190-6807}, year = {2025}, volume = {131}, editor = {Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.10}, URN = {urn:nbn:de:0030-drops-239183}, doi = {10.4230/OASIcs.Manzini.10}, annote = {Keywords: Positional Burrows-Wheeler Transform, r-index, minimal position substring cover, set-maximal exact matches, genotype phasing} }
Feedback for Dagstuhl Publishing