PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Abrar, Md. Hasin; Medvedev, Paul

doi:10.4230/LIPIcs.WABI.2024.13

Abstract

Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer’s rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.

Cite As Get BibTex

Md. Hasin Abrar and Paul Medvedev. PLA-index: A k-mer Index Exploiting Rank Curve Linearity. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.WABI.2024.13

Author Details

Md. Hasin Abrar

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA

Paul Medvedev

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Funding

This material is based upon work supported by the National Science Foundation under Grants No. DBI2138585 and 1931531. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM146462. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Acknowledgements

We thank Kristoffer Sahlin and Marcel Martin for their help in identifying the Strobealign application and helping guide its implementation. We also thank Francesca Chiaromonte, Marzia Cremona, Jacopo Di Iorio, and Giorgio Vinciguerra for helpful discussions.

Supplementary Materials

Software https://github.com/medvedevgroup/pla-index Md. Hasin Abrar, Paul Medvedev. pla-index (Software). Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/artifacts.22506 browse archived version

References

Md. Hasin Abrar and Paul Medvedev. pla-index. Software, (visited on 2024-08-16). URL: https://github.com/medvedevgroup/pla-index
archived version
full metadata available at: https://doi.org/10.4230/artifacts.22506
Md. Hasin Abrar and Paul Medvedev. PLA-complexity of k-mer multisets. bioRxiv, 2024. URL: https://www.biorxiv.org/content/10.1101/2024.02.08.579510v1.
Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures. ACM Transactions on Algorithms, 18(3):1-28, 2022. URL: https://doi.org/10.1145/3524060.
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. DACs: Bringing direct access to variable-length codes. Information Processing & Management, 49(1):392-404, 2013. URL: https://doi.org/10.1016/j.ipm.2012.08.003.
Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent a set of k-long DNA sequences. ACM Computing Surveys (CSUR), 54(1):1-22, 2021. URL: https://doi.org/10.1145/3445967.
Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246-260, 1974.
Robert Mario Fano. On the number of bits required to implement an associative memory. memorandum 61. Computer Structures Group, Project MAC, MIT, Cambridge, Mass., nd, page 27, 1971.
Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned monotone minimal perfect hashing. In 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.ESA.2023.46.
Paolo Ferragina and Giorgio Vinciguerra. The PGM-index. Proceedings of the VLDB Endowment, 13(8):1162-1175, 2020. URL: https://doi.org/10.14778/3389133.3389135.
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. FITing-Tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. ACM, 2019. URL: https://doi.org/10.1145/3299869.3319860.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Darryl Ho, Saurabh Kalikar, Sanchit Misra, Jialin Ding, Vasimuddin Md, Nesime Tatbul, Heng Li, and Tim Kraska. LISA: Learned indexes for sequence analysis. bioRxiv, 2021. URL: https://doi.org/10.1101/2020.12.22.423964.
Youngmok Jung and Dongsu Han. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics, 38(9):2404-2413, 2022. URL: https://doi.org/10.1093/bioinformatics/btac137.
Saurabh Kalikar, Chirag Jain, Md Vasimuddin, and Sanchit Misra. Accelerating minimap2 for long-read sequencing applications on modern cpus. Nature Computational Science, 2(2):78-83, 2022. URL: https://doi.org/10.1038/s43588-022-00201-8.
Melanie Kirsche, Arun Das, and Michael C Schatz. Sapling: Accelerating suffix array queries with learned data models. Bioinformatics, 37(6):744-749, 2021.
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 2018. URL: https://doi.org/10.1145/3183713.3196909.
Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2020. URL: https://doi.org/10.1101/gr.260604.119.
Yuta Mori. libdivsufsort. URL: https://github.com/y-256/libdivsufsort/.
Joseph O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574-578, 1981. URL: https://doi.org/10.1145/358746.358758.
Giulio Pibiri. pthash. URL: https://github.com/jermp/pthash.
Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, 2022.
Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1339-1348, 2021.
Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Computing Surveys, 53(6):1-36, 2020. URL: https://doi.org/10.1145/3415148.
Drosophila reference genome. URL: ftp://ftp.ensembl.org/pub/release-97/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz.
Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J Hoyt, Dylan J Taylor, Nicolas Altemose, Paul W Hook, Sergey Koren, Mikko Rautiainen, Ivan A Alexandrov, et al. The complete sequence of a human Y chromosome. Nature, 621(7978):344-354, 2023.
Kristoffer Sahlin. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biology, 23(1), 2022. URL: https://doi.org/10.1186/s13059-022-02831-7.
Kristoffer Sahlin and Marcel Martin. Personal communication.

PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Authors Md. Hasin Abrar , Paul Medvedev

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Authors Md. Hasin Abrar , Paul Medvedev

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message