Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

Qiu, Yutong; Ma, Cong; Xie, Han; Kingsford, Carl

doi:10.4230/LIPIcs.WABI.2019.18

File

LIPIcs.WABI.2019.18.pdf

Filesize: 0.75 MB
19 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2019.18
URN: urn:nbn:de:0030-drops-110483

Author Details

Yutong Qiu

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Cong Ma

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Han Xie

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Carl Kingsford

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Acknowledgements

The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC) [Nystrom et al., 2015]. C. K. is co-founder of Ocean Genomics, Inc.

Cite AsGet BibTex

Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford. Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 18:1-18:5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.WABI.2019.18

Abstract

Transcriptomic structural variants (TSVs) - large-scale transcriptome sequence change due to structural variation - are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangement Problem (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1. Combining these, we obtain a 3/16-approximation algorithm for MCAP when k=2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the graph structures that require k>1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID.

Subject Classification

ACM Subject Classification

Applied computing → Computational transcriptomics

Keywords

transcriptomic structural variation
integer linear programming
heterogeneity

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Dvir Aran, Marina Sirota, and Atul J Butte. Systematic pan-cancer analysis of tumour purity. Nature Communications, 6:8971, 2015.
Ken Chen, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath, Michael C Wendl, Qunyuan Zhang, Devin P Locke, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9):677, 2009.
Nadia M Davidson, Ian J Majewski, and Alicia Oshlack. JAFFA: High sensitivity transcriptome-focused fusion gene detection. Genome Medicine, 7(1):43, 2015.
Michael WN Deininger, John M Goldman, and Junia V Melo. The molecular biology of chronic myeloid leukemia. Blood, 96(10):3343-3356, 2000.
Jesse R Dixon, Jie Xu, Vishnu Dileep, Ye Zhan, Fan Song, et al. Integrative detection and analysis of structural variation in cancer genomes. Nature Genetics, 50(10):1388, 2018.
Adi F Gazdar, Venkatesh Kurvari, Arvind Virmani, Lauren Gollahon, Masahiro Sakaguchi, et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. International Journal of Cancer, 78(6):766-774, 1998.
Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
Steffen Heber, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, and Pavel A Pevzner. Splicing graphs and EST assembly problem. Bioinformatics, 18(suppl_1):S181-S188, 2002.
Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E Eichler, and S Cenk Sahinalp. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26(12):i350-i357, 2010.
Zhiqin Huang, David TW Jones, Yonghe Wu, Peter Lichter, and Marc Zapatka. confFuse: high-confidence fusion gene detection across tumor entities. Frontiers in Genetics, 8:137, 2017.
Wenlong Jia, Kunlong Qiu, Minghui He, Pengfei Song, Quan Zhou, et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biology, 14(2):R12, 2013.
Ryan M Layer, Colby Chiang, Aaron R Quinlan, and Ira M Hall. LUMPY: a probabilistic framework for structural variant discovery. Genome Biology, 15(6):R84, 2014.
Silvia Liu, Wei-Hsiang Tsai, Ying Ding, Rui Chen, Zhou Fang, et al. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic Acids Research, 44(5):e47-e47, 2015.
Cong Ma, Mingfu Shao, and Carl Kingsford. SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biology, 19(1):52, 2018.
Andrew McPherson, Fereydoun Hormozdiari, Abdalnasser Zayed, Ryan Giuliany, Gavin Ha, et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Computational Biology, 7(5):e1001138, 2011.
Daniel Nicorici, Mihaela Satalan, Henrik Edgren, Sara Kangaspeska, Astrid Murumagi, Olli Kallioniemi, Sami Virtanen, and Olavi Kilkku. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv, page 011650, 2014.
Nicholas A Nystrom, Michael J Levine, Ralph Z Roskies, and J Scott. Bridges: a uniquely flexible HPC resource for new communities and data analytics. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, page 30. ACM, 2015.
Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M Stütz, Vladimir Benes, and Jan O Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18):i333-i339, 2012.
Robert Sedgewick. Algorithms in C, Part 5: Graph Algorithms, Third Edition. Addison-Wesley Professional, third edition, 2001.
Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt von Haeseler, and Michael C Schatz. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6):461-468, 2018.
Wandaliz Torres-García, Siyuan Zheng, Andrey Sivachenko, Rahulsimham Vegesna, Qianghu Wang, Rong Yao, Michael F Berger, John N Weinstein, Gad Getz, and Roel GW Verhaak. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics, 30(15):2224-2226, 2014.
Xiaoke Wang, Renata Q Zamolyi, Hongying Zhang, Vera L Pannain, Fabiola Medeiros, Michele Erickson-Johnson, Robert B Jenkins, and Andre M Oliveira. Fusion of HMGA1 to the LPP/TPRG1 intergenic region in a lipoma identified by mapping paraffin-embedded tissues. Cancer Genetics and Cytogenetics, 196(1):64-67, 2010.