Detecting Mutations by eBWT

Authors Nicola Prezza , Nadia Pisanti , Marinella Sciortino , Giovanna Rosone

Thumbnail PDF


  • Filesize: 0.5 MB
  • 15 pages

Document Identifiers

Author Details

Nicola Prezza
  • Dipartimento di Informatica, University of Pisa, Italy
Nadia Pisanti
  • Dipartimento di Informatica, University of Pisa, Italy, and, ERABLE Team, INRIA, Lyon, France
Marinella Sciortino
  • Dipartimento di Matematica e Informatica, University of Palermo, Italy
Giovanna Rosone
  • Dipartimento di Informatica, University of Pisa, Italy

Cite AsGet BibTex

Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone. Detecting Mutations by eBWT. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Mathematics of computing → Combinatorial algorithms
  • BWT
  • LCP Array
  • SNPs
  • Reference-free
  • Assembly-free


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. C. Ander, O.B. Schulz-Trieglaff, J. Stoye, and A.J. Cox. metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences. BMC Bioinf., 14(5):S2, 2013. Google Scholar
  2. M.J. Bauer, A.J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoret. Comput. Sci., 483(0):134-148, 2013. Google Scholar
  3. E. Birmelé, P. Crescenzi, R.A. Ferreira, R. Grossi, V. Lacroix, A. Marino, N. Pisanti, G.A.T. Sacomoto, and M.-F. Sagot. Efficient Bubble Enumeration in Directed Graphs. In SPIRE, LNCS 7608, pages 118-129, 2012. Google Scholar
  4. M. Burrows and D.J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DIGITAL System Research Center, 1994. Google Scholar
  5. A.J. Cox, F. Garofalo, G. Rosone, and M. Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37:17-33, 2016. Google Scholar
  6. A.J. Cox, T. Jakobi, G. Rosone, and O.B. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In WABI, LNBI 7534, pages 214-224, 2012. Google Scholar
  7. D.D. Dolle, Z. Liu, M. Cotten, J.T. Simpson, Z. Iqbal, R. Durbin, S.A. McCarthy, and T.M. Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Gen. Res., 27(2):300-309, 2017. Google Scholar
  8. L. Egidi and G. Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, LNCS 10508, pages 176-190, 2017. Google Scholar
  9. D. Earl et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Gen. Res., 21(12):2224-2241, 2011. Google Scholar
  10. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398, 2000. Google Scholar
  11. S.N. Gardner and B.G. Hall. When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes. PLoS ONE, 8(12):e81760, 2013. Google Scholar
  12. Z. Iqbal, I. Turner, G. McVean, P. Flicek, and M. Caccamo. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226-232, 2012. Google Scholar
  13. K. Kimura and A. Koike. Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data. BMC Bioinf., 16(suppl.18):S5, 2015. Google Scholar
  14. K. Kimura and A. Koike. Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data. Bioinformatics, 31(10):1577-1583, 2015. Google Scholar
  15. T.M. Kowalski, S. Grabowski, and S. Deorowicz. Indexing arbitrary-length k-mers in sequencing reads. PLoS ONE, 10(7), 2015. Google Scholar
  16. B. Langmead and S.L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357-359, 2012. Google Scholar
  17. R.M. Leggett and D. MacLean. Reference-free SNP detection: dealing with the data deluge. BMC Genomics, 15(4):S10, 2014. Google Scholar
  18. R.M. Leggett, R.H. Ramirez-Gonzalez, W. Verweij, C.G. Kawashima, Z. Iqbal, J.D.G. Jones, M. Caccamo, and D. MacLean. Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs. PLoS ONE, 8(3):1-11, 03 2013. Google Scholar
  19. C. Lemaitre, L. Ciortuz, and P. Peterlongo. Mapping-free and assembly-free discovery of inversion breakpoints from raw NGS reads. In AlCoB, pages 119-130, 2014. Google Scholar
  20. H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  21. R. Li, C. Yu, Y. Li, T. W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009. Google Scholar
  22. S. Li, R. Li, H. Li, J. Lu, Y. Li, L. Bolund, M.H. Schierup, and J. Wang. SOAPindel: efficient identification of indels from short paired reads. Gen. Res., 23(1):195-200, 2013. Google Scholar
  23. A. Limasset, J.-F. Flot, and P. Peterlongo. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. CoRR, abs/1711.03336, 2017. Google Scholar
  24. F.A. Louza, S. Gog, and G.P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017. Google Scholar
  25. F.A. Louza, G.P. Telles, S. Hoffmann, and C.D.A. Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology, 12(1):26, 2017. Google Scholar
  26. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In SODA, pages 319-327, 1990. Google Scholar
  27. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theoret. Comput. Sci., 387(3):298-312, 2007. Google Scholar
  28. P. Peterlongo, C. Riou, E. Drezen, and C. Lemaitre. DiscoSnp++: de novo detection of small variants from raw unassembled read set(s). bioRxiv, 2017. Google Scholar
  29. P. Peterlongo, N. Schnel, N. Pisanti, M.-F. Sagot, and V. Lacroix. Identifying SNPs without a Reference Genome by comparing raw reads. In SPIRE, LNCS 6393, pages 147-158, 2010. Google Scholar
  30. N. Philippe, M. Salson, T. Lecroq, M. Léonard, T. Commes, and E. Rivals. Querying large read collections in main memory: a versatile data structure. BMC Bioinf., 12:242, 2011. Google Scholar
  31. G.A.T. Sacomoto, J. Kielbassa, R. Chikhi, R. Uricaru, P. Antoniou, M.-F. Sagot, P. Peterlongo, and V. Lacroix. KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinf., 13(S-6):S5, 2012. Google Scholar
  32. L. Salmela and E. Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30(24):3506-3514, 2014. Google Scholar
  33. L. Salmela, R. Walve, E. Rivals, and E. Ukkonen. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics, 33(6):799-806, 2017. Google Scholar
  34. M. Schirmer, R. D’Amore, U.Z. Ijaz, N. Hall, and C. Quince. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinf., 17(1):125, 2016. Google Scholar
  35. J. Schröder, H. Schröder, S.J. Puglisi, R. Sinha, and B. Schmidt. SHREC: a short-read error correction method. Bioinformatics, 25(17):2157-2163, 2009. Google Scholar
  36. F. Shi. Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches. In ASIAN, LNCS 1179, pages 11-22, 1996. Google Scholar
  37. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  38. R. Uricaru, G. Rizk, V. Lacroix, E. Quillery, O: Plantard, R. Chikhi, C. Lemaitre, and P. Peterlongo. Reference-free detection of isolated SNPs. Nuc.Acids Res, 43(2):e11, 2015. Google Scholar
  39. N. Välimäki and E. Rivals. Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In ISBRA, LNCS 7875, pages 237-248, 2013. Google Scholar