Document

# Merging Sorted Lists of Similar Strings

## File

LIPIcs.CPM.2023.22.pdf
• Filesize: 0.74 MB
• 15 pages

## Acknowledgements

I would like to acknowledge Richard Durbin and Travis Gagie, for their encouragement to write up this work, Shane McCarthy for providing the data sets that made the need for a collision heap apparent, Gonzalo Navarro for suggesting the trie approach and pointing at Thorup’s work, and Shinichi Morishita and Yoshihiko Suzuki for their many helpful comments and review of the work.

## Cite As

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.22

## Abstract

Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Theory and algorithms for application domains
##### Keywords
• heap
• trie
• longest common prefix

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. A. Amir, G. Franceschini, R. Grossi, T. Kopelowitz, M. Lewenstein, and N. Lewenstein. Managing unbounded-length keys in comparison-driven data structures with application to online indexing. SIAM J. on Computing, 43:1396-1416, 2014.
2. T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2009.
3. E. Fredkin. Trie memory. Comm. of the ACM, 3:490-499, 1960.
4. Donals E. Knuth. The Art of Computer Programming Vol. 3. Addison Wesley, 1998.
5. M. Kokot, M. Dlugosz, and S. Deorowicz. Kmc3: Counting and manipulating k-mer statistics. Bioinformatics, 33:2759-2761, 2017.
6. G. Marcais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27:764-770, 2011.
7. Eugene Myers. https://github.com/thegenemyers/FASTK, 2020.
8. Arang Rhie, Shane McCarthy, Olivier Frederigo, (others), Eugene Myers, Richard Durbin, Adam Phillippy, and Erich Jarvis. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592:737-746, 2021.
9. M. Thorup. On ram priority queues. SIAM J. on Computing, 30:86-109, 2000.
10. P. van Emde Boas, R. Kaas, , and E. Zijlstra. Design and implementation of an efficient priority queue. Mathematics Systems Theory, 10:99-127, 1977.
X

Feedback for Dagstuhl Publishing