Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

Authors Yutong Qiu, Cong Ma, Han Xie, Carl Kingsford

Thumbnail PDF


  • Filesize: 0.75 MB
  • 19 pages

Document Identifiers

Author Details

Yutong Qiu
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Cong Ma
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Han Xie
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Carl Kingsford
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA


The results shown here are in part based upon data generated by the TCGA Research Network: This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC) [Nystrom et al., 2015]. C. K. is co-founder of Ocean Genomics, Inc.

Cite AsGet BibTex

Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford. Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 18:1-18:5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Transcriptomic structural variants (TSVs) - large-scale transcriptome sequence change due to structural variation - are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangement Problem (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1. Combining these, we obtain a 3/16-approximation algorithm for MCAP when k=2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the graph structures that require k>1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational transcriptomics
  • transcriptomic structural variation
  • integer linear programming
  • heterogeneity


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Dvir Aran, Marina Sirota, and Atul J Butte. Systematic pan-cancer analysis of tumour purity. Nature Communications, 6:8971, 2015. Google Scholar
  2. Ken Chen, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath, Michael C Wendl, Qunyuan Zhang, Devin P Locke, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9):677, 2009. Google Scholar
  3. Nadia M Davidson, Ian J Majewski, and Alicia Oshlack. JAFFA: High sensitivity transcriptome-focused fusion gene detection. Genome Medicine, 7(1):43, 2015. Google Scholar
  4. Michael WN Deininger, John M Goldman, and Junia V Melo. The molecular biology of chronic myeloid leukemia. Blood, 96(10):3343-3356, 2000. Google Scholar
  5. Jesse R Dixon, Jie Xu, Vishnu Dileep, Ye Zhan, Fan Song, et al. Integrative detection and analysis of structural variation in cancer genomes. Nature Genetics, 50(10):1388, 2018. Google Scholar
  6. Adi F Gazdar, Venkatesh Kurvari, Arvind Virmani, Lauren Gollahon, Masahiro Sakaguchi, et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. International Journal of Cancer, 78(6):766-774, 1998. Google Scholar
  7. Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. Google Scholar
  8. Steffen Heber, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, and Pavel A Pevzner. Splicing graphs and EST assembly problem. Bioinformatics, 18(suppl_1):S181-S188, 2002. Google Scholar
  9. Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E Eichler, and S Cenk Sahinalp. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26(12):i350-i357, 2010. Google Scholar
  10. Zhiqin Huang, David TW Jones, Yonghe Wu, Peter Lichter, and Marc Zapatka. confFuse: high-confidence fusion gene detection across tumor entities. Frontiers in Genetics, 8:137, 2017. Google Scholar
  11. Wenlong Jia, Kunlong Qiu, Minghui He, Pengfei Song, Quan Zhou, et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biology, 14(2):R12, 2013. Google Scholar
  12. Ryan M Layer, Colby Chiang, Aaron R Quinlan, and Ira M Hall. LUMPY: a probabilistic framework for structural variant discovery. Genome Biology, 15(6):R84, 2014. Google Scholar
  13. Silvia Liu, Wei-Hsiang Tsai, Ying Ding, Rui Chen, Zhou Fang, et al. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic Acids Research, 44(5):e47-e47, 2015. Google Scholar
  14. Cong Ma, Mingfu Shao, and Carl Kingsford. SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biology, 19(1):52, 2018. Google Scholar
  15. Andrew McPherson, Fereydoun Hormozdiari, Abdalnasser Zayed, Ryan Giuliany, Gavin Ha, et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Computational Biology, 7(5):e1001138, 2011. Google Scholar
  16. Daniel Nicorici, Mihaela Satalan, Henrik Edgren, Sara Kangaspeska, Astrid Murumagi, Olli Kallioniemi, Sami Virtanen, and Olavi Kilkku. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv, page 011650, 2014. Google Scholar
  17. Nicholas A Nystrom, Michael J Levine, Ralph Z Roskies, and J Scott. Bridges: a uniquely flexible HPC resource for new communities and data analytics. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, page 30. ACM, 2015. Google Scholar
  18. Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M Stütz, Vladimir Benes, and Jan O Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18):i333-i339, 2012. Google Scholar
  19. Robert Sedgewick. Algorithms in C, Part 5: Graph Algorithms, Third Edition. Addison-Wesley Professional, third edition, 2001. Google Scholar
  20. Fritz J Sedlazeck, Philipp Rescheneder, Moritz Smolka, Han Fang, Maria Nattestad, Arndt von Haeseler, and Michael C Schatz. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6):461-468, 2018. Google Scholar
  21. Wandaliz Torres-García, Siyuan Zheng, Andrey Sivachenko, Rahulsimham Vegesna, Qianghu Wang, Rong Yao, Michael F Berger, John N Weinstein, Gad Getz, and Roel GW Verhaak. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics, 30(15):2224-2226, 2014. Google Scholar
  22. Xiaoke Wang, Renata Q Zamolyi, Hongying Zhang, Vera L Pannain, Fabiola Medeiros, Michele Erickson-Johnson, Robert B Jenkins, and Andre M Oliveira. Fusion of HMGA1 to the LPP/TPRG1 intergenic region in a lipoma identified by mapping paraffin-embedded tissues. Cancer Genetics and Cytogenetics, 196(1):64-67, 2010. Google Scholar