Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Authors Solon P. Pissis, Ahmad Retha



PDF
Thumbnail PDF

File

LIPIcs.SEA.2018.16.pdf
  • Filesize: 0.58 MB
  • 14 pages

Document Identifiers

Author Details

Solon P. Pissis
  • Department of Informatics, King’s College London, London, UK
Ahmad Retha
  • Department of Informatics, King’s College London, London, UK

Cite AsGet BibTex

Solon P. Pissis and Ahmad Retha. Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line. In 17th International Symposium on Experimental Algorithms (SEA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 103, pp. 16:1-16:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.SEA.2018.16

Abstract

An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm^2+N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N * ceil[m/w]) with pre-processing time and space O(m * ceil[m/w]), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N * ceil[M/w]) with pre-processing time and space O(M * ceil[M/w]), which is prohibitive in practice. We present a new on-line O(N * ceil[M/w])-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • on-line algorithms
  • algorithms on strings
  • dictionary matching
  • elastic-degenerate string
  • Variant Call Format

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup, Jens P. Secher, and Maz Spork. Optimal on-line decremental connectivity in trees. Inf. Process. Lett., 64(4):161-164, 1997. URL: http://dx.doi.org/10.1016/S0020-0190(97)00170-1.
  2. Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32(4):497-504, 2016. Google Scholar
  3. Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Pattern matching on elastic-degenerate text with errors. In Gabriele Fici, Marinella Sciortino, and Rossano Venturini, editors, String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Palermo, Italy, September 26-29, 2017, Proceedings, volume 10508 of Lecture Notes in Computer Science, pages 74-90. Springer, 2017. URL: http://dx.doi.org/10.1007/978-3-319-67428-5_7.
  4. M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. Google Scholar
  5. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5208, URL: http://dx.doi.org/10.1109/SFCS.1997.646102.
  6. Travis Gagie, Danny Hermelin, Gad M. Landau, and Oren Weimann. Binary jumbled pattern matching on trees and tree-like structures. Algorithmica, 73(3):571-588, 2015. URL: http://dx.doi.org/10.1007/s00453-014-9957-6.
  7. Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-Line Pattern Matching on Similar Texts. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), volume 78 of Leibniz International Proceedings in Informatics (LIPIcs), pages 9:1-9:14, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: http://dx.doi.org/10.4230/LIPIcs.CPM.2017.9.
  8. Guillaume Holley, Roland Wittler, and Jens Stoye. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11:3, 2016. Google Scholar
  9. Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):361-370, 2013. Google Scholar
  10. T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik, and M. Clamp. The Ensembl genome database project. Nucleic Acids Research, 30(1):38-41, 2002. URL: http://dx.doi.org/10.1093/nar/30.1.38.
  11. Alice Héliou, Solon P. Pissis, and Simon J. Puglisi. emMAW: computing minimal absent words in external memory. Bioinformatics, 33(17):2746-2749, 2017. URL: http://dx.doi.org/10.1093/bioinformatics/btx209.
  12. Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate texts. In Frank Drewes, Carlos Martín-Vide, and Bianca Truthe, editors, Language and Automata Theory and Applications - 11th International Conference, LATA 2017, Umeå, Sweden, March 6-9, 2017, Proceedings, volume 10168 of Lecture Notes in Computer Science, pages 131-142, 2017. URL: http://dx.doi.org/10.1007/978-3-319-53733-7_9.
  13. Paul Julian Kersey, James E. Allen, Irina Armean, Sanjay Boddu, Bruce J. Bolt, Denise Carvalho-Silva, Mikkel Christensen, Paul Davis, Lee J. Falin, Christoph Grabmueller, Jay C. Humphrey, Arnaud Kerhornou, Julia Khobova, Naveen K. Aranganathan, Nicholas Langridge, Ernesto Lowy, Mark D. McDowall, Uma Maheswari, Michael Nuhn, Chuang Kee Ong, Bert Overduin, Michael Paulini, Helder Pedro, Emily Perry, Giulietta Spudich, Electra Tapanari, Brandon Walts, Gareth Williams, Marcela K. Tello-Ruiz, Joshua C. Stein, Sharon Wei, Doreen Ware, Daniel M. Bolser, Kevin L. Howe, Eugene Kulesha, Daniel Lawson, Gareth Maslen, and Daniel M. Staines. Ensembl genomes 2016: more genomes, more complexity. Nucleic Acids Research, 44(Database-Issue):574-580, 2016. Google Scholar
  14. Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 222-233. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-43681-4_18.
  15. Joong Chae Na, Hyunjoon Kim, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment: A compressed index for similar strings. Theor. Comput. Sci., 638:159-170, 2016. Google Scholar
  16. Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, 2002. Google Scholar
  17. Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian J. Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler, and Benedict Paten. Building a pan-genome reference for a population. Journal of Computational Biology, 22(5):387-401, 2015. Google Scholar
  18. Nadia Ben Nsira, Mourad Elloumi, and Thierry Lecroq. On-line string matching in highly similar DNA sequences. Mathematics in Computer Science, 11(2):113-126, 2017. URL: http://dx.doi.org/10.1007/s11786-016-0280-2.
  19. Nadia Ben Nsira, Thierry Lecroq, and Mourad Elloumi. A fast Boyer-Moore type pattern matching algorithm for highly similar sequences. IJDMB, 13(3):266-288, 2015. URL: http://dx.doi.org/10.1504/IJDMB.2015.072101.
  20. Siavash Sheikhizadeh, M. Eric Schranz, Mehmet Akdel, Dick de Ridder, and Sandra Smit. Pantools: representation, storage and exploration of pan-genomic data. Bioinformatics, 32(17):487-493, 2016. Google Scholar
  21. Raquel M. Silva, Diogo Pratas, Luísa Castro, Armando J. Pinho, and Paulo J. S. G. Ferreira. Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics, 31(15):2421-2425, 2015. URL: http://dx.doi.org/10.1093/bioinformatics/btv189.
  22. Jouni Sirén. Indexing variation graphs. In Sándor P. Fekete and Vijaya Ramachandran, editors, Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments, ALENEX 2017, Barcelona, Spain, Hotel Porta Fira, January 17-18, 2017., pages 13-27. SIAM, 2017. URL: http://dx.doi.org/10.1137/1.9781611974768.2.
  23. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, 2015. Google Scholar
  24. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, pages 1-18, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail