abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation

Mao, Shunfu; Mohajer, Soheil; Ramachandran, Kannan; Tse, David; Kannan, Sreeram

doi:10.4230/LIPIcs.WABI.2017.15

Abstract

Variant calling, in particular, calling SNPs (Single Nucleotide Polymorphisms) is a fundamental task in genomics. While existing packages offer excellent performance on calling SNPs which have uniquely mapped reads, they suffer in loci where the reads are multiply mapped, and are unable to make any reliable calls. Variants in multiply mapped loci can arise, for example in long segmental duplications, and can play important role in evolution and disease.

In this paper, we develop a new SNP caller named abSNP, which offers three innovations. (a) abSNP calls SNPs from RNA-Seq data. Since RNA-Seq data is primarily sampled from gene regions, this method is inexpensive. (b) abSNP is able to successfully make calls on repetitive gene regions by exploiting the quality scores of multiply mapped reads carefully in order to make variant calls. (c) abSNP exploits a specific feature of RNA-Seq data, namely the varying abundance of different genes, in order to identify which repetitive copy a particular read is sampled from.

We demonstrate that the proposed method offers significant performance gains on repetitive regions in simulated data. In particular, the algorithm is able to achieve near-perfect sensitivity on high-coverage SNPs, even when multiply mapped.

Abecasis Lab. GlfMultiples. URL: http://genome.sph.umich.edu/wiki/GlfMultiples.
Broad Institute. GATK Best Practices workflow for SNP and indel calling on RNAseq data. URL: https://software.broadinstitute.org/gatk/guide/article?id=3891.
Elizabeth T. Cirulli, Abanish Singh, Kevin V. Shianna, Dongliang Ge, Jason P. Smith, Jessica M. Maia, Erin L. Heinzen, James J. Goedert, and David B. Goldstein. Screening the human exome: a comparison of whole genome and whole transcriptome sequencing. Genome Biology, 11(5):R57, 2010. URL: http://dx.doi.org/10.1186/gb-2010-11-5-r57.
The Genome Reference Consortium. Human Genome Assembly GRCh37. URL: https://www.ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37.
Kristal Curtis, Ameet Talwalkar, Matei Zaharia, Armando Fox, and David A. Patterson. SiRen: Leveraging Similar Regions for Efficient and Accurate Variant Calling, 2015. URL: http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-159.html.
M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. Angel, M. A. Rivas, M. Hann, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, and M. J. Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5):491-498, April 2011. URL: http://dx.doi.org/10.1038/ng.806.
A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T. R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2012. URL: http://dx.doi.org/10.1093/bioinformatics/bts635.
Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-read sequencing, 2012. URL: http://arxiv.org/abs/arXiv:1207.3907.
Maryam Ghareghani, Seyed Abolfazl Motahari, Shahram Khazaei, and Mostafa Tavassolipour. Gw-call: Accurate genome-wide variant caller. bioRxiv, 2016. URL: http://dx.doi.org/10.1101/079905.
R. Goya, M. G. F. Sun, R. D. Morin, G. Leung, G. Ha, K. C. Wiegand, J. Senz, A. Crisan, M. A. Marra, M. Hirst, D. Huntsman, K. P. Murphy, S. Aparicio, and S. P. Shah. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics, 26(6):730-736, February 2010. URL: http://dx.doi.org/10.1093/bioinformatics/btq040.
The SAM/BAM Format Specification Working Group. Sequence Alignment/Map Format Specifiation. URL: https://samtools.github.io/hts-specs/SAMv1.pdf.
UCSC Genome Informatics Group. UCSC Genome Browser. URL: https://genome.ucsc.edu/cgi-bin/hgTables.
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L. Salzberg. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4):R36, 2013. URL: http://dx.doi.org/10.1186/gb-2013-14-4-r36.
D. C. Koboldt, Q. Zhang, D. E. Larson, D. Shen, M. D. McLellan, L. Lin, C. A. Miller, E. R. Mardis, L. Ding, and R. K. Wilson. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3):568-576, 2012. URL: http://dx.doi.org/10.1101/gr.129684.111.
Bo Li and Colin N Dewey. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics, 12(1):323, 2011. URL: http://dx.doi.org/10.1186/1471-2105-12-323.
H. Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987-2993, September 2011. URL: http://dx.doi.org/10.1093/bioinformatics/btr509.
H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(11):1851-1858, November 2008. URL: http://dx.doi.org/10.1101/gr.078212.108.
Wei Li. RNASeqReadSimulator: A Simple RNA-Seq Read Simulator. URL: http://alumni.cs.ucr.edu/~liw/rnaseqreadsimulator.html.
Robert Piskol, Gokul Ramaswami, and Jin Billy Li. Reliable identification of genomic variants from RNA-seq data. The American Journal of Human Genetics, 93(4):641-651, 2013. URL: http://dx.doi.org/10.1016/j.ajhg.2013.08.008.
Adam Roberts and Lior Pachter. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Meth, 10(1):71-73, January 2013. Brief Communication. URL: http://dx.doi.org/10.1038/nmeth.2251.
Daniel F. Simola and Junhyong Kim. Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome Biology, 12(6):R55, 2011. URL: http://dx.doi.org/10.1186/gb-2011-12-6-r55.
X. Tang, S. Baheti, K. Shameer, K. J. Thompson, Q. Wills, N. Niu, I. N. Holcomb, S. C. Boutet, R. Ramakrishnan, J. M. Kachergus, J.-P. A. Kocher, R. M. Weinshilboum, L. Wang, E. A. Thompson, and K. R. Kalari. The eSNV-detect: a computational system to identify expressed single nucleotide variants from transcriptome sequencing data. Nucleic Acids Research, 42(22):e172-e172, October 2014. URL: http://dx.doi.org/10.1093/nar/gku1005.
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, and Marc Salit. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotech, 32(3):246-251, Mar 2014. Computational Biology. URL: http://dx.doi.org/10.1038/nbt.2835.

abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation

Authors Shunfu Mao, Soheil Mohajer, Kannan Ramachandran, David Tse, Sreeram Kannan

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message