DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2017.7

Fast Spaced Seed Hashing

Authors: Samuele Girotto, Matteo Comin, and Cinzia Pizzi

Published in: LIPIcs, Volume 88, 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)

Abstract

Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed.

Cite as

Samuele Girotto, Matteo Comin, and Cinzia Pizzi. Fast Spaced Seed Hashing. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{girotto_et_al:LIPIcs.WABI.2017.7,
  author =	{Girotto, Samuele and Comin, Matteo and Pizzi, Cinzia},
  title =	{{Fast Spaced Seed Hashing}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{7:1--7:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.7},
  URN =		{urn:nbn:de:0030-drops-76501},
  doi =		{10.4230/LIPIcs.WABI.2017.7},
  annote =	{Keywords: k-mers, spaced seeds, efficient hashing}
}

Document

DOI: 10.4230/DagSemProc.10231.4

Efficient computation of statistics for words with mismatches

Authors: Cinzia Pizzi

Published in: Dagstuhl Seminar Proceedings, Volume 10231, Structure Discovery in Biology: Motifs, Networks & Phylogenies (2010)

Abstract

Since early stages of bioinformatics, substrings played a crucial role in the search and discovery of significant biological signals. Despite the advent of a large number of different approaches and models toaccomplish these tasks, substrings continue to be widely used to determine statistical distributions and compositions of biological sequences at various levels of details. Here we overview efficient algorithms that were recently proposed to compute the actual and the expected frequency for words with k mismatches, when it is assumed that the words of interest occur at least once exactly in the sequence under analysis. Efficiency means these algorithms are polynomial in k rather than exponential as with an enumerative approach, and independent on the length of the query word. These algorithms are all based on a common incremental approach of a preprocessing step that allows to answer queries related to any word occurring in the text efficiently. The same approach can be used with a sliding window scanning of the sequence to compute the same statistics for words of fixed lengths, even more efficiently. The efficient computation of both expected and actual frequency of sub- strings, combined with a study on the monotonicity of popular scores such as z-scores, allows to build tables of feasible size in reasonable time, and can therefore be used in practical applications.

Cite as

Cinzia Pizzi. Efficient computation of statistics for words with mismatches. In Structure Discovery in Biology: Motifs, Networks & Phylogenies. Dagstuhl Seminar Proceedings, Volume 10231, pp. 1-22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2010)

Copy BibTex To Clipboard

@InProceedings{pizzi:DagSemProc.10231.4,
  author =	{Pizzi, Cinzia},
  title =	{{Efficient computation of statistics for words with mismatches}},
  booktitle =	{Structure Discovery in Biology: Motifs, Networks \& Phylogenies},
  pages =	{1--22},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2010},
  volume =	{10231},
  editor =	{Alberto Apostolico and Andreas Dress and Laxmi Parida},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.10231.4},
  URN =		{urn:nbn:de:0030-drops-27384},
  doi =		{10.4230/DagSemProc.10231.4},
  annote =	{Keywords: Statistics on words, mismatches, dynamic programming, biological sequences.}
}

Document

DOI: 10.4230/DagSemProc.06201.5

On the Monotonicity of the String Correction Factor for Words with Mismatches

Authors: Alberto Apostolico and Cinzia Pizzi

Published in: Dagstuhl Seminar Proceedings, Volume 6201, Combinatorial and Algorithmic Foundations of Pattern and Association Discovery (2006)

Abstract

The string correction factor is the term by which the probability of a word $w$ needs to be multiplied in order to account for character changes or ``errors'' occurring in at most $k$ arbitrary positions in that word. The behavior of this factor, as a function of $k$ and of the word length, has implications on the number of candidates that need to be considered and weighted when looking for subwords of a sequence that present unusually recurrent replicas within some bounded number of mismatches. Specifically, it is seen that over intervals of mono- or bi-tonicity for the correction factor, only some of the candidates need be considered. This mitigates the computation and leads to tables of over-represented words that are more compact to represent and inspect. In recent work, expectation and score monotonicity has been established for a number of cases of interest, under {em i.i.d.} probabilistic assumptions. The present paper reviews the cases of bi-tonic behavior for the correction factor, concentrating on the instance in which the question is still open.

Cite as

Alberto Apostolico and Cinzia Pizzi. On the Monotonicity of the String Correction Factor for Words with Mismatches. In Combinatorial and Algorithmic Foundations of Pattern and Association Discovery. Dagstuhl Seminar Proceedings, Volume 6201, pp. 1-9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2006)

Copy BibTex To Clipboard

@InProceedings{apostolico_et_al:DagSemProc.06201.5,
  author =	{Apostolico, Alberto and Pizzi, Cinzia},
  title =	{{On the Monotonicity of the String Correction Factor for Words with Mismatches}},
  booktitle =	{Combinatorial and Algorithmic Foundations of Pattern and Association Discovery},
  pages =	{1--9},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2006},
  volume =	{6201},
  editor =	{Rudolf Ahlswede and Alberto Apostolico and Vladimir I. Levenshtein},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagSemProc.06201.5},
  URN =		{urn:nbn:de:0030-drops-7899},
  doi =		{10.4230/DagSemProc.06201.5},
  annote =	{Keywords: Pattern discovery, Motif, Over-represented word, Monotone score, Correction Factor}
}

3 Search Results for "Pizzi, Cinzia"

Fast Spaced Seed Hashing

Abstract

Cite as

Efficient computation of statistics for words with mismatches

Abstract

Cite as

On the Monotonicity of the String Correction Factor for Words with Mismatches

Abstract

Cite as

Thanks for your feedback!

Could not send message