A*PA2: Up to 19× Faster Exact Global Alignment

Author Ragnar Groot Koerkamp



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.17.pdf
  • Filesize: 8.31 MB
  • 25 pages

Document Identifiers

Author Details

Ragnar Groot Koerkamp
  • ETH Zurich, Switzerland

Acknowledgements

I am grateful to Pesho Ivanov for many discussions on {A*PA}. Furthermore, I thank Daniel Liu for discussions, feedback, and suggesting additional related papers, André Kahles, Harun Mustafa, and Gunnar Rätsch for feedback on the manuscript, Andrea Guarracino and Santiago Marco-Sola for sharing the WFA and BiWFA benchmark datasets, and Gary Benson for help with debugging the BitPAl bitpacking code.

Cite AsGet BibTex

Ragnar Groot Koerkamp. A*PA2: Up to 19× Faster Exact Global Alignment. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.17

Abstract

Motivation. Pairwise alignment is at the core of computational biology. Most commonly used exact methods are either based on O(ns) band doubling or O(n+s²) diagonal transition, where n is the sequence length and s the number of errors. However, as the length of sequences has grown, these exact methods are often replaced by approximate methods based on e.g. seed-and-extend and heuristics to bound the computed region. We would like to develop an exact method that matches the performance of these approximate methods. Recently, Astarix introduced the A* shortest path algorithm with the seed heuristic for exact sequence-to-graph alignment. A*PA adapted and improved this for pairwise sequence alignment and achieves near-linear runtime when divergence (error rate) is low, at the cost of being very slow when divergence is high. Methods. We introduce A*PA2, an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods. Like Edlib, A*PA2 uses Ukkonen’s band doubling in combination with Myers' bitpacking. A*PA2 1) uses large block sizes inspired by Block Aligner, 2) extends this with SIMD (single instruction, multiple data), 3) introduces a new profile for efficient computations, 4) introduces a new optimistic technique for traceback based on diagonal transition, 5) avoids recomputation of states where possible, and 6) applies the heuristics developed in A*PA and improves them using pre-pruning. Results. With the first 4 engineering optimizations, A*PA2-simple has complexity O(ns) and is 6× to 8× faster than Edlib for sequences ≥ 10 kbp. A*PA2-full also includes the heuristic and is often near-linear in practice for sequences with small divergence. The average runtime of A*PA2 is 19× faster than the exact aligners BiWFA and Edlib on >500 kbp long ONT (Oxford Nanopore Technologies) reads of a human genome having 6% divergence on average. On shorter ONT reads of 11% average divergence the speedup is 5.6× (avg. length 11 kbp) and 0.81× (avg. length 800 bp). On all tested datasets, A*PA2 is competitive with or faster than approximate methods.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Software and its engineering → Software performance
  • Theory of computation → Shortest paths
  • Theory of computation → Dynamic programming
Keywords
  • Edit distance
  • Pairwise alignment
  • A*
  • Shortest path
  • Dynamic programming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Bowen Alpern, Larry Carter, and Kang Su Gatlin. Microparallelism and high-performance protein matching. In Sidney Karin, editor, Proceedings Supercomputing '95, San Diego, CA, USA, December 4-8, 1995, page 24. ACM, 1995. URL: https://doi.org/10.1145/224170.224222.
  2. Ricardo A. Baeza-Yates and Gaston H. Gonnet. A new approach to text searching. Communications of the ACM, 35(10):74-82, October 1992. URL: https://doi.org/10.1145/135239.135243.
  3. Gary Benson, Yözen Hernández, and Joshua Loving. A bit-parallel, general integer-scoring sequence alignment algorithm. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM 2013, Bad Herrenalb, Germany, June 17-19, 2013. Proceedings, volume 7922 of Lecture Notes in Computer Science, pages 50-61. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_7.
  4. Gary Benson, Avivit Levy, S. Maimoni, D. Noifeld, and B. Riva Shalom. Lcsk: A refined similarity measure. Theoretical Computer Science, 638:11-26, July 2016. URL: https://doi.org/10.1016/j.tcs.2015.11.026.
  5. Anne Bergeron and Sylvie Hamel. Vector algorithms for approximate string matching. International Journal of Foundations of Computer Science, 13(1):53-66, February 2002. URL: https://doi.org/10.1142/s0129054102000947.
  6. Jeff Daily. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), February 2016. URL: https://doi.org/10.1186/s12859-016-0930-z.
  7. Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269-271, December 1959. URL: https://doi.org/10.1007/bf01386390.
  8. Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. Seqan an efficient, generic C++ library for sequence analysis. BMC Bioinform., 9(1), January 2008. URL: https://doi.org/10.1186/1471-2105-9-11.
  9. Michael Farrar. Striped smith-waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2):156-161, November 2007. URL: https://doi.org/10.1093/bioinformatics/btl582.
  10. James W. Fickett. Fast optimal alignment. Nucleic Acids Research, 12(1):175-179, 1984. URL: https://doi.org/10.1093/nar/12.1part1.175.
  11. J.T. Frielingsdorf. Improving optimal sequence alignments through a simd-accelerated library. Master’s thesis, University of Oslo, 2015. URL: https://bib.irb.hr/datoteka/758607.diplomski_Martin_Sosic.pdf.
  12. Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3):705-708, December 1982. URL: https://doi.org/10.1016/0022-2836(82)90398-9.
  13. Ragnar Groot Koerkamp. RagnarGrootKoerkamp/astar-pairwise-aligner. Software, swhId: https://archive.softwareheritage.org/swh:1:dir:0da264cc1294d18cb07ee0d76934929e5836e239;origin=https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner;visit=swh:1:snp:3158598ab43adbfc0661bcb6534fa0cf81397964;anchor=swh:1:rev:d5088352cf2a7474c35b23b046cbda3c4d94c988 (visited on 2024-08-16). URL: https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner.
  14. Ragnar Groot Koerkamp and Pesho Ivanov. Exact global alignment using A* with chaining seed heuristic and match pruning. Bioinformatics, 40(3), January 2024. URL: https://doi.org/10.1093/bioinformatics/btae032.
  15. Frank O. Hadlock. Minimum detour methods for string or sequence comparison. Congressus Numerantium, 61:263-274, 1988. Google Scholar
  16. Peter W. Harrison, Rodrigo Lopez, Nadim Rahman, Stefan Gutnick Allen, Raheela Aslam, Nicola Buso, Carla A. Cummins, Yasmin Fathy, Eloy Felix, Mihai Glont, Suran Jayathilaka, Sandeep Kadam, Manish Kumar, Katharina B. Lauer, Geetika Malhotra, Abayomi Mosaku, Ossama Edbali, Young Mi Park, Andrew Parton, Matt Pearce, Jose Francisco Estrada pena, Joseph Rossetto, Craig Russell, Sandeep Selvakumar, Xènia Pérez Sitjà, Alexey Sokolov, Ross Thorne, Marianna Ventouratou, Peter Walter, Galabina Yordanova, Amonida Zadissa, Guy Cochrane, Niklas Blomberg, and Rolf Apweiler. The COVID-19 data portal: accelerating sars-cov-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Research, 49(W1):619-623, May 2021. URL: https://doi.org/10.1093/nar/gkab417.
  17. Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100-107, 1968. URL: https://doi.org/10.1109/tssc.1968.300136.
  18. Richard Hughey. Parallel hardware for sequence comparison and alignment. CABIOS, 12(6):473-479, 1996. URL: https://doi.org/10.1093/bioinformatics/12.6.473.
  19. Heikki Hyyrö, Kimmo Fredriksson, and Gonzalo Navarro. Increased bit-parallelism for approximate and multiple string matching. ACM Journal of Experimental Algorithmics, 10, December 2005. URL: https://doi.org/10.1145/1064546.1180617.
  20. Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, and Martin T. Vechev. Astarix: Fast and optimal sequence-to-graph alignment. In Russell Schwartz, editor, Research in Computational Molecular Biology - 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10-13, 2020, Proceedings, volume 12074 of Lecture Notes in Computer Science, pages 104-119. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-45257-5_7.
  21. Pesho Ivanov, Benjamin Bichsel, and Martin T. Vechev. Fast and optimal sequence-to-graph alignment guided by seeds. In Itsik Pe'er, editor, Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22-25, 2022, Proceedings, volume 13278 of Lecture Notes in Computer Science, pages 306-325. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-04749-7_22.
  22. Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs. Journal of Computational Biology, 29(11):1237-1251, November 2022. URL: https://doi.org/10.1089/cmb.2022.0266.
  23. Joseph B. Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Review, 25(2):201-237, April 1983. URL: https://doi.org/10.1137/1025045.
  24. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707-710, 1965. URL: https://api.semanticscholar.org/CorpusID:60827152.
  25. Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103-2110, March 2016. URL: https://doi.org/10.1093/bioinformatics/btw152.
  26. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
  27. Daniel Liu and Martin Steinegger. Block aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices. Bioinformatics, 39(8), August 2023. URL: https://doi.org/10.1093/bioinformatics/btad487.
  28. Joshua Loving, Yözen Hernández, and Gary Benson. Bitpal: a bit-parallel, general integer-scoring sequence alignment algorithm. Bioinformatics, 30(22):3166-3173, July 2014. URL: https://doi.org/10.1093/bioinformatics/btu507.
  29. Santiago Marco-Sola, Jordan M. Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moretó. Optimal gap-affine alignment in O(s) space. Bioinformatics, 39(2), February 2023. URL: https://doi.org/10.1093/bioinformatics/btad074.
  30. Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, 37(4):456-463, September 2020. URL: https://doi.org/10.1093/bioinformatics/btaa777.
  31. Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251-266, November 1986. URL: https://doi.org/10.1007/bf01840446.
  32. Gene Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46(3):395-415, May 1999. URL: https://doi.org/10.1145/316542.316550.
  33. Gene Myers. Efficient local alignment discovery amongst noisy long reads. In Daniel G. Brown and Burkhard Morgenstern, editors, Algorithms in Bioinformatics - 14th International Workshop, WABI 2014, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume 8701 of Lecture Notes in Computer Science, pages 52-67. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44753-6_5.
  34. Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, March 2001. URL: https://doi.org/10.1145/375360.375365.
  35. Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, March 1970. URL: https://doi.org/10.1016/0022-2836(70)90057-4.
  36. Dimitrios P. Papamichail and Georgios P. Papamichail. Improved algorithms for approximate string matching (extended abstract). BMC Bioinformatics, 10(S-1), January 2009. URL: https://doi.org/10.1186/1471-2105-10-s1-s10.
  37. Filip Pavetić, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. Fast and simple algorithms for computing both LCS_k and LCS_k+. CoRR, 2017. URL: https://doi.org/10.48550/arXiv.1705.07279.
  38. Filip Pavetić, Goran Žužić, and Mile Šikić. LCSk++: practical similarity metric for long strings. arXiv, 2014. URL: https://doi.org/10.48550/arXiv.1407.2407.
  39. Torbjørn Rognes. Faster smith-waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12(1), June 2011. URL: https://doi.org/10.1186/1471-2105-12-221.
  40. Torbjørn Rognes and Erling Seeberg. Six-fold speed-up of smith-waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics, 16(8):699-706, August 2000. URL: https://doi.org/10.1093/bioinformatics/16.8.699.
  41. David Sankoff. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences, 69(1):4-6, January 1972. URL: https://doi.org/10.1073/pnas.69.1.4.
  42. Peter H. Sellers. On the theory and computation of evolutionary distances. SIAM Journal on Applied Mathematics, 26(4):787-793, June 1974. URL: https://doi.org/10.1137/0126070.
  43. Haojing Shao and Jue Ruan. Bsalign: A library for nucleotide sequence alignment. Genomics, Proteomics & Bioinformatics, March 2024. URL: https://doi.org/10.1093/gpbjnl/qzae025.
  44. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195-197, March 1981. URL: https://doi.org/10.1016/0022-2836(81)90087-5.
  45. John L. Spouge. Speeding up dynamic programming algorithms for finding optimal lattice paths. SIAM Journal on Applied Mathematics, 49(5):1552-1566, October 1989. URL: https://doi.org/10.1137/0149094.
  46. John L. Spouge. Fast optimal alignment. CABIOS, 7(1):1-7, 1991. URL: https://doi.org/10.1093/bioinformatics/7.1.1.
  47. Hajime Suzuki and Masahiro Kasahara. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19-S(1):33-47, February 2018. URL: https://doi.org/10.1186/s12859-018-2014-8.
  48. Adam Szalkowski, Christian Ledergerber, Philipp Krähenbühl, and Christophe Dessimoz. SWPS3 – fast multi-threaded vectorized smith-waterman for ibm cell/b.e. and ×86/sse2. BMC Research Notes, 1(1):107, 2008. URL: https://doi.org/10.1186/1756-0500-1-107.
  49. Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100-118, January 1985. URL: https://doi.org/10.1016/s0019-9958(85)80046-2.
  50. T. K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52-57, 1968. URL: https://doi.org/10.1007/bf01074755.
  51. Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168-173, January 1974. URL: https://doi.org/10.1145/321796.321811.
  52. Sumit Walia, Cheng Ye, Arkid Bera, Dhruvi Lodhavia, and Yatish Turakhia. TALCO: tiling genome sequence alignment using convergence of traceback pointers. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024, Edinburgh, United Kingdom, March 2-6, 2024, pages 91-107. IEEE, March 2024. URL: https://doi.org/10.1109/hpca57654.2024.00044.
  53. W J Wilbur and D J Lipman. Rapid similarity searches of nucleic acid and protein data banks. Proceedings of the National Academy of Sciences, 80(3):726-730, February 1983. URL: https://doi.org/10.1073/pnas.80.3.726.
  54. Andrzej Wozniak. Using video-oriented instructions to speed up sequence comparison. CABIOS, 13(2):145-150, 1997. URL: https://doi.org/10.1093/bioinformatics/13.2.145.
  55. Sun Wu and Udi Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, October 1992. URL: https://doi.org/10.1145/135239.135244.
  56. Sun Wu, Udi Manber, Gene Myers, and Webb Miller. An O(NP) sequence comparison algorithm. Information Processing Letters, 35(6):317-323, September 1990. URL: https://doi.org/10.1016/0020-0190(90)90035-v.
  57. Mengyao Zhao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. SSW library: An SIMD smith-waterman C/C++ library for use in genomic applications. PLoS ONE, 8(12), December 2013. URL: https://doi.org/10.1371/journal.pone.0082138.
  58. M. Šošic. An SIMD dynamic programming C/C++ library. Master’s thesis, University of Zagreb, 2015. URL: https://bib.irb.hr/datoteka/758607.diplomski_Martin_Sosic.pdf.
  59. Martin Šošić and Mile Šikić. Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33(9):1394-1395, January 2017. URL: https://doi.org/10.1093/bioinformatics/btw753.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail