Efficient Parallel Output-Sensitive Edit Distance

Ding, Xiangyun; Dong, Xiaojun; Gu, Yan; Liu, Youzhe; Sun, Yihan

doi:10.4230/LIPIcs.ESA.2023.40

Abstract

In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.

Alberto Apostolico, Mikhail J Atallah, Lawrence L Larmore, and Scott McFaddin. Efficient parallel algorithms for string editing and related problems. SIAM J. on Computing, 19(5):968-988, 1990.
Nimar S Arora, Robert D Blumofe, and C Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems (TOCS), 34(2):115-144, 2001.
K Nandan Babu and Sanjeev Saxena. Parallel algorithms for the longest common subsequence problem. In IEEE International Conference on High Performance Computing (HiPC), pages 120-125. IEEE, 1997.
Michael A. Bender and Martin Farach-Colton. The lca problem revisited. In Latin American Symposium on Theoretical Informatics (LATIN), pages 88-94. Springer, 2000.
Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. Genbank. Nucleic acids research, 41(D1):D36-D42, 2012.
Guy E. Blelloch. Scans as primitive parallel operations. IEEE Trans. on Comput., 38(11), 1989.
Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. Parlaylib - A toolkit for parallel algorithms on shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 507-509, 2020.
Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. Optimal parallel algorithms in the binary-forking model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 89-102, 2020.
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. on Computing, 27(1), 1998.
Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nicolas Papernot. Bad characters: Imperceptible nlp attacks. In IEEE Symposium on Security and Privacy (SP), pages 1987-2004. IEEE, 2022.
Dana Carroll. Focus: genome editing: genome editing: past, present, and future. The Yale journal of biology and medicine, 90(4):653, 2017.
Jung Hee Cheon, Miran Kim, and Kristin Lauter. Homomorphic computation of edit distance. In International Conference on Financial Cryptography and Data Security, pages 194-212. Springer, 2015.
Kendell Clement, Holly Rees, Matthew C Canver, Jason M Gehrke, Rick Farouni, Jonathan Y Hsu, Mitchel A Cole, David R Liu, J Keith Joung, Daniel E Bauer, et al. Crispresso2 provides accurate and rapid genome editing sequence analysis. Nature biotechnology, 37(3):224-226, 2019.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms (3rd edition). MIT Press, 2009.
Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Virkumar Vazirani. Algorithms. McGraw-Hill Higher Education New York, 2008.
Xiangyun Ding, Xiaojun Dong, Yan Gu, Yihan Sun, and Youzhe Liu. Efficient parallel output-sensitive edit distance. arXiv preprint:2306.17461, 2023.
Xiangyun Ding, Xiaojun Dong, Yan Gu, Yihan Sun, and Youzhe Liu. Parallel implementations for output-sensitive edit distance. https://github.com/ucrparlay/Edit-Distance, 2023.
Robert W Floyd. Algorithm 97: shortest path. Commun. ACM, 5(6):345, 1962.
Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. ACM SIGACT News, 17(4):52-54, 1986.
Zvi Galil and Raffaele Giancarlo. Parallel string matching with k mismatches. Theoretical Computer Science (TCS), 51(3):341-348, 1987.
Zvi Galil and Raffaele Giancarlo. Data structures and algorithms for approximate string matching. Journal of Complexity, 4(1):33-72, 1988.
Zvi Galil and Kunsoo Park. An improved algorithm for approximate string matching. SIAM Journal on Computing, 19(6):989-999, 1990.
Michael T Goodrich and Roberto Tamassia. Algorithm design and applications. Wiley Hoboken, 2015.
Yan Gu, Ziyang Men, Zheqi Shen, Yihan Sun, and Zijin Wan. Parallel longest increasing subsequence and van emde boas trees. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2023.
Yan Gu, Zachary Napier, and Yihan Sun. Analysis of work-stealing and parallel cache complexity. In SIAM Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 46-60. SIAM, 2022.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. siam Journal on Computing, 13(2):338-355, 1984.
Daniel S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6):341-343, 1975.
Daniel Hládek, Ján Staš, and Matúš Pleva. Survey of automatic spelling correction. Electronics, 9(10):1670, 2020.
Md Mosabbir Hossain, Md Farhan Labib, Ahmed Sady Rifat, Amit Kumar Das, and Monira Mukta. Auto-correction of english to bengali transliteration system using levenshtein distance. In International Conference on Smart Computing & Communications (ICSCC), pages 1-5. IEEE, 2019.
Yoon-Seong Jeon, Kihyun Lee, Sang-Cheol Park, Bong-Soo Kim, Yong-Joon Cho, Sung-Min Ha, and Jongsik Chun. Ezeditor: a versatile sequence alignment editor for both rrna-and protein-coding genes. International journal of systematic and evolutionary microbiology, 64(Pt_2):689-691, 2014.
Tao Jiang, Guohui Lin, Bin Ma, and Kaizhong Zhang. A general edit distance between RNA structures. Journal of Computational Biology, 9(2):371-388, 2002.
Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Intl. Colloq. on Automata, Languages and Programming (ICALP), pages 943-955. Springer, 2003.
Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM journal of research and development, 31(2):249-260, 1987.
Peter Krusche and Alexander Tiskin. New algorithms for efficient parallel string comparison. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 209-216, 2010.
Gad M Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science (TCS), 43:239-249, 1986.
Gad M Landau and Uzi Vishkin. Fast string matching with k differences. J. Computer and System Sciences, 37(1):63-78, 1988.
Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. J. Algorithms, 10(2):157-169, 1989.
VI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, volume 10, page 707, 1966.
Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473-483, 2010.
Linux Kernel File dcn_1_0_sh_mask.h Commit History on GitHub. URL: https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_1_0_sh_mask.h.
Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835-848, 1994.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM J. on Computing, 22(5):935-948, 1993.
Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019.
Samuel McCauley. Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing. In 24th International Conference on Database Theory (ICDT 2021), volume 186, pages 21:1-21:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
Municipal history of Quebec on Wikipedia. URL: https://en.wikipedia.org/wiki/Municipal_history_of_Quebec.
Eugene W Myers. An o (nd) difference algorithm and its variations. Algorithmica, 1(1-4):251-266, 1986.
Eugene Wimberly Myers. Incremental alignment algorithms and their applications. University of Arizona, Department of Computer Science, 1986.
Jean-Frédéric Myoupo and David Seme. Time-efficient parallel algorithms for the longest common subsequence and related problems. J. Parallel Distrib. Comput., 57(2):212-223, 1999.
Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31-88, 2001.
Mike Paterson and Vlado Dančík. Longest common subsequences. In International Symposium on Mathematical Foundations of Computer Science, pages 127-142. Springer, 1994.
Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. The nih human microbiome project. Genome research, 19(12):2317-2323, 2009.
Zheqi Shen, Zijin Wan, Yan Gu, and Yihan Sun. Many sequential iterative algorithms can be parallel and (nearly) work-efficient. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2022.
Julian Shun. Fast parallel computation of longest common prefixes. In International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 387-398. IEEE, 2014.
Diomidis Spinellis. Git. IEEE software, 29(3):100-101, 2012.
Vianney Kengne Tchendji, Armel Nkonjoh Ngomade, Jerry Lacmou Zeutouo, and Jean Frédéric Myoupo. Efficient cgm-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints. Parallel Computing, 91:102598, 2020.
Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100-118, 1985.
Liang-Jiao Xue and Chung-Jui Tsai. Ageseq: analysis of genome editing by sequencing. Molecular plant, 8(9):1428-1430, 2015.
Jiaoyun Yang, Yun Xu, and Yi Shang. An efficient parallel algorithm for longest common subsequence problem on GPUs. In World Congress on Engineering, volume 1, pages 499-504, 2010.
Hongyu Zhang. Alignment of blast high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics, 19(11):1391-1396, 2003.

Efficient Parallel Output-Sensitive Edit Distance

Authors Xiangyun Ding , Xiaojun Dong , Yan Gu , Youzhe Liu , Yihan Sun

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Efficient Parallel Output-Sensitive Edit Distance

Authors Xiangyun Ding , Xiaojun Dong , Yan Gu , Youzhe Liu , Yihan Sun

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

Supplementary Materials

References

Thanks for your feedback!

Could not send message