Locality-Sensitive Bucketing Functions for the Edit Distance

Authors Ke Chen , Mingfu Shao



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.22.pdf
  • Filesize: 0.79 MB
  • 14 pages

Document Identifiers

Author Details

Ke Chen
  • Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States
Mingfu Shao
  • Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, United States

Cite AsGet BibTex

Ke Chen and Mingfu Shao. Locality-Sensitive Bucketing Functions for the Edit Distance. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.22

Abstract

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Applied computing → Computational biology
Keywords
  • Locality-sensitive hashing
  • locality-sensitive bucketing
  • long reads
  • embedding

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990. Google Scholar
  2. Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389-3402, 1997. Google Scholar
  3. Z. Bar-Yossef, T.S. Jayram, R. Krauthgamer, and R. Kumar. Approximating edit distance efficiently. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 550-559, 2004. Google Scholar
  4. Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33(6):623-630, 2015. Google Scholar
  5. Andrea Califano and Isidore Rigoutsos. FLASH: A fast look-up algorithm for string homology. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 353-359. IEEE, 1993. Google Scholar
  6. Junjie Chen, Mingyue Guo, Xiaolong Wang, and Bin Liu. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in Bioinformatics, 19(2):231-244, 2018. Google Scholar
  7. Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marçais. Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB'19), pages 167-176, New York, NY, USA, 2019. Association for Computing Machinery. Google Scholar
  8. Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 36(4):338-345, 2018. Google Scholar
  9. Robert Krauthgamer and Yuval Rabani. Improved lower bounds for embeddings into l₁. SIAM Journal on Computing, 38(6):2487-2498, 2009. Google Scholar
  10. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018. Google Scholar
  11. Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440-445, 2002. Google Scholar
  12. Denise Mak, Yevgeniy Gelfand, and Gary Benson. Indel seeds for homology search. Bioinformatics, 22(14):e341-e349, 2006. Google Scholar
  13. Guillaume Marçais, Dan DeBlasio, and Carl Kingsford. Asymptotically optimal minimizers schemes. Bioinformatics, 34(13):i13-i22, 2018. Google Scholar
  14. Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019. Google Scholar
  15. Samuel McCauley. Approximate similarity search under edit distance using locality-sensitive hashing. In 24th International Conference on Database Theory (ICDT 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021. Google Scholar
  16. Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Computational Biology, 13(10):e1005777, 2017. Google Scholar
  17. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 54(5):23-es, 2007. Google Scholar
  18. Anthony Rhoads and Kin Fai Au. PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 13(5):278-289, 2015. Google Scholar
  19. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  20. Michael Roberts, Brian R Hunt, James A Yorke, Randall A Bolanos, and Arthur L Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734-752, 2004. Google Scholar
  21. Kristoffer Sahlin. Effective sequence similarity detection with strobemers. Genome Research, 31(11):2080-2094, 2021. Google Scholar
  22. Kristoffer Sahlin, Marta Tomaszkiewicz, Kateryna D Makova, and Paul Medvedev. Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Communications, 9(1):1-12, 2018. Google Scholar
  23. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD (International Conference on Management of Data), pages 76-85, 2003. Google Scholar
  24. Yan Song, Haixu Tang, Haoyu Zhang, and Qin Zhang. Overlap detection on long, error-prone sequencing reads via smooth q-gram. Bioinformatics, 36(19):4838-4845, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail