Locality-Sensitive Bucketing Functions for the Edit Distance

Chen, Ke; Shao, Mingfu

doi:10.4230/LIPIcs.WABI.2022.22

Abstract

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.

Cite As Get BibTex

Ke Chen and Mingfu Shao. Locality-Sensitive Bucketing Functions for the Edit Distance. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.WABI.2022.22

Author Details

Ke Chen

Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States

Mingfu Shao

Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, United States

Funding

This work is supported by the US National Science Foundation (DBI-2019797) and the US National Institutes of Health (R01HG011065).

Supplementary Materials

Software (Source Code) https://github.com/Shao-Group/lsbucketing

References

Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990.
Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389-3402, 1997.
Z. Bar-Yossef, T.S. Jayram, R. Krauthgamer, and R. Kumar. Approximating edit distance efficiently. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 550-559, 2004.
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33(6):623-630, 2015.
Andrea Califano and Isidore Rigoutsos. FLASH: A fast look-up algorithm for string homology. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 353-359. IEEE, 1993.
Junjie Chen, Mingyue Guo, Xiaolong Wang, and Bin Liu. A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in Bioinformatics, 19(2):231-244, 2018.
Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marçais. Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB'19), pages 167-176, New York, NY, USA, 2019. Association for Computing Machinery.
Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 36(4):338-345, 2018.
Robert Krauthgamer and Yuval Rabani. Improved lower bounds for embeddings into l₁. SIAM Journal on Computing, 38(6):2487-2498, 2009.
Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018.
Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440-445, 2002.
Denise Mak, Yevgeniy Gelfand, and Gary Benson. Indel seeds for homology search. Bioinformatics, 22(14):e341-e349, 2006.
Guillaume Marçais, Dan DeBlasio, and Carl Kingsford. Asymptotically optimal minimizers schemes. Bioinformatics, 34(13):i13-i22, 2018.
Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford. Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127-i135, 2019.
Samuel McCauley. Approximate similarity search under edit distance using locality-sensitive hashing. In 24th International Conference on Database Theory (ICDT 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Computational Biology, 13(10):e1005777, 2017.
Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. Journal of the ACM (JACM), 54(5):23-es, 2007.
Anthony Rhoads and Kin Fai Au. PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 13(5):278-289, 2015.
Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.
Michael Roberts, Brian R Hunt, James A Yorke, Randall A Bolanos, and Arthur L Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734-752, 2004.
Kristoffer Sahlin. Effective sequence similarity detection with strobemers. Genome Research, 31(11):2080-2094, 2021.
Kristoffer Sahlin, Marta Tomaszkiewicz, Kateryna D Makova, and Paul Medvedev. Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Communications, 9(1):1-12, 2018.
Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD (International Conference on Management of Data), pages 76-85, 2003.
Yan Song, Haixu Tang, Haoyu Zhang, and Qin Zhang. Overlap detection on long, error-prone sequencing reads via smooth q-gram. Bioinformatics, 36(19):4838-4845, 2020.

Locality-Sensitive Bucketing Functions for the Edit Distance

Authors Ke Chen , Mingfu Shao

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message