Merging Sorted Lists of Similar Strings

Myers, Gene

doi:10.4230/LIPIcs.CPM.2023.22

File

Subject Classification

ACM Subject Classification

Theory of computation → Theory and algorithms for application domains

Keywords

heap
trie
longest common prefix

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Cite As Get BibTex

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.22

Author Details

Gene Myers

Okinawa Institute of Science and Technology, Japan
MPI for Molecular Cell Biology and Genetics, Dresden, Germany

Acknowledgements

I would like to acknowledge Richard Durbin and Travis Gagie, for their encouragement to write up this work, Shane McCarthy for providing the data sets that made the need for a collision heap apparent, Gonzalo Navarro for suggesting the trie approach and pointing at Thorup’s work, and Shinichi Morishita and Yoshihiko Suzuki for their many helpful comments and review of the work.

References

A. Amir, G. Franceschini, R. Grossi, T. Kopelowitz, M. Lewenstein, and N. Lewenstein. Managing unbounded-length keys in comparison-driven data structures with application to online indexing. SIAM J. on Computing, 43:1396-1416, 2014.
T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2009.
E. Fredkin. Trie memory. Comm. of the ACM, 3:490-499, 1960.
Donals E. Knuth. The Art of Computer Programming Vol. 3. Addison Wesley, 1998.
M. Kokot, M. Dlugosz, and S. Deorowicz. Kmc3: Counting and manipulating k-mer statistics. Bioinformatics, 33:2759-2761, 2017.
G. Marcais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27:764-770, 2011.
Eugene Myers. https://github.com/thegenemyers/FASTK, 2020.
Arang Rhie, Shane McCarthy, Olivier Frederigo, (others), Eugene Myers, Richard Durbin, Adam Phillippy, and Erich Jarvis. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592:737-746, 2021.
M. Thorup. On ram priority queues. SIAM J. on Computing, 30:86-109, 2000.
P. van Emde Boas, R. Kaas, , and E. Zijlstra. Design and implementation of an efficient priority queue. Mathematics Systems Theory, 10:99-127, 1977.

Merging Sorted Lists of Similar Strings

Author Gene Myers

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Merging Sorted Lists of Similar Strings

Author Gene Myers

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message