Analysis of the Period Recovery Error Bound

Authors Amihood Amir, Itai Boneh, Michael Itzhaki, Eitan Kondratovsky



PDF
Thumbnail PDF

File

LIPIcs.ESA.2020.5.pdf
  • Filesize: 0.78 MB
  • 21 pages

Document Identifiers

Author Details

Amihood Amir
  • Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Itai Boneh
  • Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Michael Itzhaki
  • Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Eitan Kondratovsky
  • Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel

Cite AsGet BibTex

Amihood Amir, Itai Boneh, Michael Itzhaki, and Eitan Kondratovsky. Analysis of the Period Recovery Error Bound. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 5:1-5:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ESA.2020.5

Abstract

The recovery problem is the problem whose input is a corrupted text T that was originally periodic, and where one wishes to recover its original period. The algorithm’s input is T without any information about either the period’s length or the period itself. An algorithm that solves this problem is called a recovery algorithm. In order to make recovery possible, there must be some assumption that not "too many" errors corrupted the initial periodic string. This is called the error bound. In previous recovery algorithms, it was shown that a given error bound of n/((2+ε)p) can lead to O(log_{1+ε} n) period candidates, that are guaranteed to include the original period, where p is the length of the original period (unknown by the algorithm) and ε > 0 is an arbitrary constant. This paper provides the first analysis of the relationship between the error bound and the number of candidates, as well as identification of the error parameters that still guarantee recovery. We improve the previously known upper error bound on the number of corruptions, n/((2+ε)p), that outputs O(log_{1+ε} n) period candidates. We show how to (1) remove ε from the bound, (2) relax the error bound to allow more errors while keeping the candidates set of size O(log n). It turns out that this relaxation on the previously known upper bound is quite challenging. To achieve this result we provide what, to our knowledge, is the first known non-trivial lower bound on the Hamming distance between two periodic strings. This proof leads to an error bound, that produces a family of period candidates of size 2log₃ n. We show that this result is tight and further provide a compact representation of the period candidates. We call this representation the canonic period seed. In addition to providing less restrictive error bounds that guarantee a smaller candidate set, we also provide a hierarchy of more restrictive upper error bounds that asymptotically reduces the size of the potential period candidate set.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Sorting and searching
Keywords
  • Period Recovery
  • Period Recovery Hierarchy
  • Hamming Distance

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Aluru, A. Apostolico, and S. V. Thankachan. Efficient alignment free sequence comparison with bounded mismatches. In Proc. 19th Research in Computational Molecular Biology Conference, RECOMB, pages 1-12, 2015. Google Scholar
  2. A. Amir, M. Amit, G.M.Landau, and D. Sokol. Period recovery of strings over the hamming and edit distances. Theortetical Computer Science, 710:2-18, 2018. Google Scholar
  3. A. Amir, E. Eisenberg, A. Levy, E. Porat, and N. Shapira. Cycle detection and correction. ACM Trans. Alg., 9(1):13, 2012. Google Scholar
  4. A. Amir and D. Keselman. Maximum agreement subtree in a set of evolutionary trees - metrics and efficient algorithms. Proc. FOCS 94, pages 758-769, 1994. Google Scholar
  5. A. Amir, A. Levy, M. Lewenstein, R. Lubin, and B. Porat. Can we recover the cover? In Proc. 28st Annual Symposium on Combinatorial Pattern Matching (CPM), LIPICS, 2017. Google Scholar
  6. A. Amir, M. Lewenstein, and E. Porat. Approximate subset matching with "don't care"s. In Proc. 12th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 305-306, 2001. Google Scholar
  7. R.L. Cann, M. Stoneking, and A.C. Wilson. Mitochondrial DNA and human evolution. Nature, 325(6099):31-36, 1987. Google Scholar
  8. M. Farach, S. Kannan, and T. Warnow. A robust model for finding optimal evolutionary trees. Proc. 25th Annual ACM Symposium on the Theory of Computing, pages 137-145, 1993. Google Scholar
  9. M. Farach, T. M. Przytycka, and M. Thorup. Computing the agreement of trees with bounded degrees. Proc. 3rd European Symposium on Algorithms, pages 381-393, 1995. Google Scholar
  10. N. J. Fine and H. S. Wilf. Uniqueness theorems for periodic functions. Proc. Amer. Math. Soc., 16:109-114, 1965. Google Scholar
  11. S. Har-Peled and S. Mahabadi. Proximity in the age of distraction: Robust approximate nearest neighbor search. In Proc. 28th ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1-15, 2017. Google Scholar
  12. S. Heydrich and A. Wiese. Faster approximation schemes for the two-dimensional knapsack problem. In Proc. 28th ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 79-98, 2017. Google Scholar
  13. C. Kalaitzis. An improved approximation guarantee for the maximum budgeted allocation problem. In Proc. 27th ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1048-1066, 2016. Google Scholar
  14. T. Kociumaka, J. Radoszewski, W. Rytter, J. Straszyński, T. Waleń, and W. Zuba. Faster recovery of approximate periods over edit distance. In Proc. 25th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS, pages 233-240. Springer, 2018. Google Scholar
  15. L.A.B. Kowada, D. Doerr, S. Dantas, and J. Stoye. New genome similarity measures based on conserved gene adjacencies. In Proc. 20th Research in Computational Molecular Biology Conference, RECOMB, pages 204-224, 2016. Google Scholar
  16. A. Ojewole, J.D. Jou, V.G. Fowler, and B.R. Donald. Bbk^* (branch and bound over k^*): A provable and efficient ensemble-based algorithm to optimize stability and binding affinity over large sequence spaces. In Proc. 21st Research in Computational Molecular Biology Conference, RECOMB, pages 157-172, 2017. Google Scholar
  17. A. Sobih, A. I. Tomescu, and V. Mäkinen. Metaflow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows. In Proc. 20th Research in Computational Molecular Biology Conference, RECOMB, pages 111-121, 2016. Google Scholar
  18. M. Steel and T. Warnow. Kaikoura tree theorems: Computing the maximum agreement subtree. Information Processing Letters, 48(2):77-82, 1993. Google Scholar
  19. M. A. Steel and D. Penny. Distributions of tree comparison metrics - some new results. Syst. Biol., 42:126-141, 1993. Google Scholar
  20. E. Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5:313-323, 1990. Google Scholar
  21. J. C. Venter and Celera Genomics Corporation. The sequence of the human genome. Science, (291):1304-1351, 2001. Google Scholar
  22. C. Wulff-Nilsen. Approximate distance oracles for planar graphs with improved query time-space tradeoff. In Proc. 27th ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 351-362, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail