PLA-index: A k-mer Index Exploiting Rank Curve Linearity

Authors Md. Hasin Abrar , Paul Medvedev



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.13.pdf
  • Filesize: 2.39 MB
  • 18 pages

Document Identifiers

Author Details

Md. Hasin Abrar
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Paul Medvedev
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
  • Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Acknowledgements

We thank Kristoffer Sahlin and Marcel Martin for their help in identifying the Strobealign application and helping guide its implementation. We also thank Francesca Chiaromonte, Marzia Cremona, Jacopo Di Iorio, and Giorgio Vinciguerra for helpful discussions.

Cite AsGet BibTex

Md. Hasin Abrar and Paul Medvedev. PLA-index: A k-mer Index Exploiting Rank Curve Linearity. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.13

Abstract

Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer’s rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Applied computing → Computational biology
  • Theory of computation → Data structures design and analysis
Keywords
  • K-mer index
  • Piece-wise linear approximation
  • Learned index

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Md. Hasin Abrar and Paul Medvedev. pla-index. Software, swhId: https://archive.softwareheritage.org/swh:1:dir:a5ea07d009da014aff392e5896ba14b7376eba13;origin=https://github.com/medvedevgroup/pla-index;visit=swh:1:snp:7244e0b6165e37aa5e4a617ddeff2bfac291bdd6;anchor=swh:1:rev:3702e31ecccb31ef2081066a59541b4cc33b9f74 (visited on 2024-08-16). URL: https://github.com/medvedevgroup/pla-index.
  2. Md. Hasin Abrar and Paul Medvedev. PLA-complexity of k-mer multisets. bioRxiv, 2024. URL: https://www.biorxiv.org/content/10.1101/2024.02.08.579510v1.
  3. Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures. ACM Transactions on Algorithms, 18(3):1-28, 2022. URL: https://doi.org/10.1145/3524060.
  4. Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. DACs: Bringing direct access to variable-length codes. Information Processing & Management, 49(1):392-404, 2013. URL: https://doi.org/10.1016/j.ipm.2012.08.003.
  5. Rayan Chikhi, Jan Holub, and Paul Medvedev. Data structures to represent a set of k-long DNA sequences. ACM Computing Surveys (CSUR), 54(1):1-22, 2021. URL: https://doi.org/10.1145/3445967.
  6. Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246-260, 1974. Google Scholar
  7. Robert Mario Fano. On the number of bits required to implement an associative memory. memorandum 61. Computer Structures Group, Project MAC, MIT, Cambridge, Mass., nd, page 27, 1971. Google Scholar
  8. Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned monotone minimal perfect hashing. In 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.ESA.2023.46.
  9. Paolo Ferragina and Giorgio Vinciguerra. The PGM-index. Proceedings of the VLDB Endowment, 13(8):1162-1175, 2020. URL: https://doi.org/10.14778/3389133.3389135.
  10. Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. FITing-Tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. ACM, 2019. URL: https://doi.org/10.1145/3299869.3319860.
  11. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  12. Darryl Ho, Saurabh Kalikar, Sanchit Misra, Jialin Ding, Vasimuddin Md, Nesime Tatbul, Heng Li, and Tim Kraska. LISA: Learned indexes for sequence analysis. bioRxiv, 2021. URL: https://doi.org/10.1101/2020.12.22.423964.
  13. Youngmok Jung and Dongsu Han. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics, 38(9):2404-2413, 2022. URL: https://doi.org/10.1093/bioinformatics/btac137.
  14. Saurabh Kalikar, Chirag Jain, Md Vasimuddin, and Sanchit Misra. Accelerating minimap2 for long-read sequencing applications on modern cpus. Nature Computational Science, 2(2):78-83, 2022. URL: https://doi.org/10.1038/s43588-022-00201-8.
  15. Melanie Kirsche, Arun Das, and Michael C Schatz. Sapling: Accelerating suffix array queries with learned data models. Bioinformatics, 37(6):744-749, 2021. Google Scholar
  16. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 2018. URL: https://doi.org/10.1145/3183713.3196909.
  17. Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2020. URL: https://doi.org/10.1101/gr.260604.119.
  18. Yuta Mori. libdivsufsort. URL: https://github.com/y-256/libdivsufsort/.
  19. Joseph O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574-578, 1981. URL: https://doi.org/10.1145/358746.358758.
  20. Giulio Pibiri. pthash. URL: https://github.com/jermp/pthash.
  21. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, 2022. Google Scholar
  22. Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1339-1348, 2021. Google Scholar
  23. Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Computing Surveys, 53(6):1-36, 2020. URL: https://doi.org/10.1145/3415148.
  24. Drosophila reference genome. URL: ftp://ftp.ensembl.org/pub/release-97/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz.
  25. Arang Rhie, Sergey Nurk, Monika Cechova, Savannah J Hoyt, Dylan J Taylor, Nicolas Altemose, Paul W Hook, Sergey Koren, Mikko Rautiainen, Ivan A Alexandrov, et al. The complete sequence of a human Y chromosome. Nature, 621(7978):344-354, 2023. Google Scholar
  26. Kristoffer Sahlin. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biology, 23(1), 2022. URL: https://doi.org/10.1186/s13059-022-02831-7.
  27. Kristoffer Sahlin and Marcel Martin. Personal communication. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail