Detecting Mutations by eBWT

Prezza, Nicola; Pisanti, Nadia; Sciortino, Marinella; Rosone, Giovanna

doi:10.4230/LIPIcs.WABI.2018.3

Abstract

In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT.
Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.

C. Ander, O.B. Schulz-Trieglaff, J. Stoye, and A.J. Cox. metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences. BMC Bioinf., 14(5):S2, 2013.
M.J. Bauer, A.J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoret. Comput. Sci., 483(0):134-148, 2013.
E. Birmelé, P. Crescenzi, R.A. Ferreira, R. Grossi, V. Lacroix, A. Marino, N. Pisanti, G.A.T. Sacomoto, and M.-F. Sagot. Efficient Bubble Enumeration in Directed Graphs. In SPIRE, LNCS 7608, pages 118-129, 2012.
M. Burrows and D.J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DIGITAL System Research Center, 1994.
A.J. Cox, F. Garofalo, G. Rosone, and M. Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37:17-33, 2016.
A.J. Cox, T. Jakobi, G. Rosone, and O.B. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In WABI, LNBI 7534, pages 214-224, 2012.
D.D. Dolle, Z. Liu, M. Cotten, J.T. Simpson, Z. Iqbal, R. Durbin, S.A. McCarthy, and T.M. Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Gen. Res., 27(2):300-309, 2017.
L. Egidi and G. Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, LNCS 10508, pages 176-190, 2017.
D. Earl et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Gen. Res., 21(12):2224-2241, 2011.
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398, 2000.
S.N. Gardner and B.G. Hall. When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes. PLoS ONE, 8(12):e81760, 2013.
Z. Iqbal, I. Turner, G. McVean, P. Flicek, and M. Caccamo. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226-232, 2012.
K. Kimura and A. Koike. Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data. BMC Bioinf., 16(suppl.18):S5, 2015.
K. Kimura and A. Koike. Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data. Bioinformatics, 31(10):1577-1583, 2015.
T.M. Kowalski, S. Grabowski, and S. Deorowicz. Indexing arbitrary-length k-mers in sequencing reads. PLoS ONE, 10(7), 2015.
B. Langmead and S.L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357-359, 2012.
R.M. Leggett and D. MacLean. Reference-free SNP detection: dealing with the data deluge. BMC Genomics, 15(4):S10, 2014.
R.M. Leggett, R.H. Ramirez-Gonzalez, W. Verweij, C.G. Kawashima, Z. Iqbal, J.D.G. Jones, M. Caccamo, and D. MacLean. Identifying and Classifying Trait Linked Polymorphisms in Non-Reference Species by Walking Coloured de Bruijn Graphs. PLoS ONE, 8(3):1-11, 03 2013.
C. Lemaitre, L. Ciortuz, and P. Peterlongo. Mapping-free and assembly-free discovery of inversion breakpoints from raw NGS reads. In AlCoB, pages 119-130, 2014.
H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009.
R. Li, C. Yu, Y. Li, T. W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009.
S. Li, R. Li, H. Li, J. Lu, Y. Li, L. Bolund, M.H. Schierup, and J. Wang. SOAPindel: efficient identification of indels from short paired reads. Gen. Res., 23(1):195-200, 2013.
A. Limasset, J.-F. Flot, and P. Peterlongo. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. CoRR, abs/1711.03336, 2017.
F.A. Louza, S. Gog, and G.P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017.
F.A. Louza, G.P. Telles, S. Hoffmann, and C.D.A. Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology, 12(1):26, 2017.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In SODA, pages 319-327, 1990.
S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theoret. Comput. Sci., 387(3):298-312, 2007.
P. Peterlongo, C. Riou, E. Drezen, and C. Lemaitre. DiscoSnp++: de novo detection of small variants from raw unassembled read set(s). bioRxiv, 2017.
P. Peterlongo, N. Schnel, N. Pisanti, M.-F. Sagot, and V. Lacroix. Identifying SNPs without a Reference Genome by comparing raw reads. In SPIRE, LNCS 6393, pages 147-158, 2010.
N. Philippe, M. Salson, T. Lecroq, M. Léonard, T. Commes, and E. Rivals. Querying large read collections in main memory: a versatile data structure. BMC Bioinf., 12:242, 2011.
G.A.T. Sacomoto, J. Kielbassa, R. Chikhi, R. Uricaru, P. Antoniou, M.-F. Sagot, P. Peterlongo, and V. Lacroix. KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinf., 13(S-6):S5, 2012.
L. Salmela and E. Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30(24):3506-3514, 2014.
L. Salmela, R. Walve, E. Rivals, and E. Ukkonen. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics, 33(6):799-806, 2017.
M. Schirmer, R. D’Amore, U.Z. Ijaz, N. Hall, and C. Quince. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinf., 17(1):125, 2016.
J. Schröder, H. Schröder, S.J. Puglisi, R. Sinha, and B. Schmidt. SHREC: a short-read error correction method. Bioinformatics, 25(17):2157-2163, 2009.
F. Shi. Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches. In ASIAN, LNCS 1179, pages 11-22, 1996.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.
R. Uricaru, G. Rizk, V. Lacroix, E. Quillery, O: Plantard, R. Chikhi, C. Lemaitre, and P. Peterlongo. Reference-free detection of isolated SNPs. Nuc.Acids Res, 43(2):e11, 2015.
N. Välimäki and E. Rivals. Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In ISBRA, LNCS 7875, pages 237-248, 2013.

Detecting Mutations by eBWT

Authors Nicola Prezza , Nadia Pisanti , Marinella Sciortino , Giovanna Rosone

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Detecting Mutations by eBWT

Authors Nicola Prezza , Nadia Pisanti , Marinella Sciortino , Giovanna Rosone

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message