Efficient Parallel Output-Sensitive Edit Distance

Authors Xiangyun Ding , Xiaojun Dong , Yan Gu , Youzhe Liu , Yihan Sun



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.40.pdf
  • Filesize: 2.17 MB
  • 20 pages

Document Identifiers

Author Details

Xiangyun Ding
  • University of California, Riverside, CA, USA
Xiaojun Dong
  • University of California, Riverside, CA, USA
Yan Gu
  • University of California, Riverside, CA, USA
Youzhe Liu
  • University of California, Riverside, CA, USA
Yihan Sun
  • University of California, Riverside, CA, USA

Cite As Get BibTex

Xiangyun Ding, Xiaojun Dong, Yan Gu, Youzhe Liu, and Yihan Sun. Efficient Parallel Output-Sensitive Edit Distance. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 40:1-40:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ESA.2023.40

Abstract

In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space.
We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.

Subject Classification

ACM Subject Classification
  • Theory of computation → Parallel algorithms
Keywords
  • Edit Distance
  • Parallel Algorithms
  • String Algorithms
  • Dynamic Programming
  • Pattern Matching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alberto Apostolico, Mikhail J Atallah, Lawrence L Larmore, and Scott McFaddin. Efficient parallel algorithms for string editing and related problems. SIAM J. on Computing, 19(5):968-988, 1990. Google Scholar
  2. Nimar S Arora, Robert D Blumofe, and C Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems (TOCS), 34(2):115-144, 2001. Google Scholar
  3. K Nandan Babu and Sanjeev Saxena. Parallel algorithms for the longest common subsequence problem. In IEEE International Conference on High Performance Computing (HiPC), pages 120-125. IEEE, 1997. Google Scholar
  4. Michael A. Bender and Martin Farach-Colton. The lca problem revisited. In Latin American Symposium on Theoretical Informatics (LATIN), pages 88-94. Springer, 2000. Google Scholar
  5. Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. Genbank. Nucleic acids research, 41(D1):D36-D42, 2012. Google Scholar
  6. Guy E. Blelloch. Scans as primitive parallel operations. IEEE Trans. on Comput., 38(11), 1989. Google Scholar
  7. Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. Parlaylib - A toolkit for parallel algorithms on shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 507-509, 2020. Google Scholar
  8. Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. Optimal parallel algorithms in the binary-forking model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 89-102, 2020. Google Scholar
  9. Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. on Computing, 27(1), 1998. Google Scholar
  10. Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nicolas Papernot. Bad characters: Imperceptible nlp attacks. In IEEE Symposium on Security and Privacy (SP), pages 1987-2004. IEEE, 2022. Google Scholar
  11. Dana Carroll. Focus: genome editing: genome editing: past, present, and future. The Yale journal of biology and medicine, 90(4):653, 2017. Google Scholar
  12. Jung Hee Cheon, Miran Kim, and Kristin Lauter. Homomorphic computation of edit distance. In International Conference on Financial Cryptography and Data Security, pages 194-212. Springer, 2015. Google Scholar
  13. Kendell Clement, Holly Rees, Matthew C Canver, Jason M Gehrke, Rick Farouni, Jonathan Y Hsu, Mitchel A Cole, David R Liu, J Keith Joung, Daniel E Bauer, et al. Crispresso2 provides accurate and rapid genome editing sequence analysis. Nature biotechnology, 37(3):224-226, 2019. Google Scholar
  14. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms (3rd edition). MIT Press, 2009. Google Scholar
  15. Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Virkumar Vazirani. Algorithms. McGraw-Hill Higher Education New York, 2008. Google Scholar
  16. Xiangyun Ding, Xiaojun Dong, Yan Gu, Yihan Sun, and Youzhe Liu. Efficient parallel output-sensitive edit distance. arXiv preprint:2306.17461, 2023. Google Scholar
  17. Xiangyun Ding, Xiaojun Dong, Yan Gu, Yihan Sun, and Youzhe Liu. Parallel implementations for output-sensitive edit distance. https://github.com/ucrparlay/Edit-Distance, 2023.
  18. Robert W Floyd. Algorithm 97: shortest path. Commun. ACM, 5(6):345, 1962. Google Scholar
  19. Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. ACM SIGACT News, 17(4):52-54, 1986. Google Scholar
  20. Zvi Galil and Raffaele Giancarlo. Parallel string matching with k mismatches. Theoretical Computer Science (TCS), 51(3):341-348, 1987. Google Scholar
  21. Zvi Galil and Raffaele Giancarlo. Data structures and algorithms for approximate string matching. Journal of Complexity, 4(1):33-72, 1988. Google Scholar
  22. Zvi Galil and Kunsoo Park. An improved algorithm for approximate string matching. SIAM Journal on Computing, 19(6):989-999, 1990. Google Scholar
  23. Michael T Goodrich and Roberto Tamassia. Algorithm design and applications. Wiley Hoboken, 2015. Google Scholar
  24. Yan Gu, Ziyang Men, Zheqi Shen, Yihan Sun, and Zijin Wan. Parallel longest increasing subsequence and van emde boas trees. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2023. Google Scholar
  25. Yan Gu, Zachary Napier, and Yihan Sun. Analysis of work-stealing and parallel cache complexity. In SIAM Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 46-60. SIAM, 2022. Google Scholar
  26. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. siam Journal on Computing, 13(2):338-355, 1984. Google Scholar
  27. Daniel S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6):341-343, 1975. Google Scholar
  28. Daniel Hládek, Ján Staš, and Matúš Pleva. Survey of automatic spelling correction. Electronics, 9(10):1670, 2020. Google Scholar
  29. Md Mosabbir Hossain, Md Farhan Labib, Ahmed Sady Rifat, Amit Kumar Das, and Monira Mukta. Auto-correction of english to bengali transliteration system using levenshtein distance. In International Conference on Smart Computing & Communications (ICSCC), pages 1-5. IEEE, 2019. Google Scholar
  30. Yoon-Seong Jeon, Kihyun Lee, Sang-Cheol Park, Bong-Soo Kim, Yong-Joon Cho, Sung-Min Ha, and Jongsik Chun. Ezeditor: a versatile sequence alignment editor for both rrna-and protein-coding genes. International journal of systematic and evolutionary microbiology, 64(Pt_2):689-691, 2014. Google Scholar
  31. Tao Jiang, Guohui Lin, Bin Ma, and Kaizhong Zhang. A general edit distance between RNA structures. Journal of Computational Biology, 9(2):371-388, 2002. Google Scholar
  32. Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Intl. Colloq. on Automata, Languages and Programming (ICALP), pages 943-955. Springer, 2003. Google Scholar
  33. Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM journal of research and development, 31(2):249-260, 1987. Google Scholar
  34. Peter Krusche and Alexander Tiskin. New algorithms for efficient parallel string comparison. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 209-216, 2010. Google Scholar
  35. Gad M Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science (TCS), 43:239-249, 1986. Google Scholar
  36. Gad M Landau and Uzi Vishkin. Fast string matching with k differences. J. Computer and System Sciences, 37(1):63-78, 1988. Google Scholar
  37. Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. J. Algorithms, 10(2):157-169, 1989. Google Scholar
  38. VI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, volume 10, page 707, 1966. Google Scholar
  39. Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473-483, 2010. Google Scholar
  40. Linux Kernel File dcn_1_0_sh_mask.h Commit History on GitHub. URL: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_1_0_sh_mask.h.
  41. Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835-848, 1994. Google Scholar
  42. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on Computing, 22(5):935-948, 1993. Google Scholar
  43. Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019. Google Scholar
  44. Samuel McCauley. Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing. In 24th International Conference on Database Theory (ICDT 2021), volume 186, pages 21:1-21:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. Google Scholar
  45. Municipal history of Quebec on Wikipedia. URL: https://en.wikipedia.org/wiki/Municipal_history_of_Quebec.
  46. Eugene W Myers. An o (nd) difference algorithm and its variations. Algorithmica, 1(1-4):251-266, 1986. Google Scholar
  47. Eugene Wimberly Myers. Incremental alignment algorithms and their applications. University of Arizona, Department of Computer Science, 1986. Google Scholar
  48. Jean-Frédéric Myoupo and David Seme. Time-efficient parallel algorithms for the longest common subsequence and related problems. J. Parallel Distrib. Comput., 57(2):212-223, 1999. Google Scholar
  49. Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31-88, 2001. Google Scholar
  50. Mike Paterson and Vlado Dančík. Longest common subsequences. In International Symposium on Mathematical Foundations of Computer Science, pages 127-142. Springer, 1994. Google Scholar
  51. Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. The nih human microbiome project. Genome research, 19(12):2317-2323, 2009. Google Scholar
  52. Zheqi Shen, Zijin Wan, Yan Gu, and Yihan Sun. Many sequential iterative algorithms can be parallel and (nearly) work-efficient. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2022. Google Scholar
  53. Julian Shun. Fast parallel computation of longest common prefixes. In International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 387-398. IEEE, 2014. Google Scholar
  54. Diomidis Spinellis. Git. IEEE software, 29(3):100-101, 2012. Google Scholar
  55. Vianney Kengne Tchendji, Armel Nkonjoh Ngomade, Jerry Lacmou Zeutouo, and Jean Frédéric Myoupo. Efficient cgm-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints. Parallel Computing, 91:102598, 2020. Google Scholar
  56. Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100-118, 1985. Google Scholar
  57. Liang-Jiao Xue and Chung-Jui Tsai. Ageseq: analysis of genome editing by sequencing. Molecular plant, 8(9):1428-1430, 2015. Google Scholar
  58. Jiaoyun Yang, Yun Xu, and Yi Shang. An efficient parallel algorithm for longest common subsequence problem on GPUs. In World Congress on Engineering, volume 1, pages 499-504, 2010. Google Scholar
  59. Hongyu Zhang. Alignment of blast high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics, 19(11):1391-1396, 2003. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail