Merging Sorted Lists of Similar Strings

Author Gene Myers



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.22.pdf
  • Filesize: 0.74 MB
  • 15 pages

Document Identifiers

Author Details

Gene Myers
  • Okinawa Institute of Science and Technology, Japan
  • MPI for Molecular Cell Biology and Genetics, Dresden, Germany

Acknowledgements

I would like to acknowledge Richard Durbin and Travis Gagie, for their encouragement to write up this work, Shane McCarthy for providing the data sets that made the need for a collision heap apparent, Gonzalo Navarro for suggesting the trie approach and pointing at Thorup’s work, and Shinichi Morishita and Yoshihiko Suzuki for their many helpful comments and review of the work.

Cite As Get BibTex

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.22

Abstract

Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Subject Classification

ACM Subject Classification
  • Theory of computation → Theory and algorithms for application domains
Keywords
  • heap
  • trie
  • longest common prefix

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. A. Amir, G. Franceschini, R. Grossi, T. Kopelowitz, M. Lewenstein, and N. Lewenstein. Managing unbounded-length keys in comparison-driven data structures with application to online indexing. SIAM J. on Computing, 43:1396-1416, 2014. Google Scholar
  2. T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2009. Google Scholar
  3. E. Fredkin. Trie memory. Comm. of the ACM, 3:490-499, 1960. Google Scholar
  4. Donals E. Knuth. The Art of Computer Programming Vol. 3. Addison Wesley, 1998. Google Scholar
  5. M. Kokot, M. Dlugosz, and S. Deorowicz. Kmc3: Counting and manipulating k-mer statistics. Bioinformatics, 33:2759-2761, 2017. Google Scholar
  6. G. Marcais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27:764-770, 2011. Google Scholar
  7. Eugene Myers. https://github.com/thegenemyers/FASTK, 2020.
  8. Arang Rhie, Shane McCarthy, Olivier Frederigo, (others), Eugene Myers, Richard Durbin, Adam Phillippy, and Erich Jarvis. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592:737-746, 2021. Google Scholar
  9. M. Thorup. On ram priority queues. SIAM J. on Computing, 30:86-109, 2000. Google Scholar
  10. P. van Emde Boas, R. Kaas, , and E. Zijlstra. Design and implementation of an efficient priority queue. Mathematics Systems Theory, 10:99-127, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail