BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies

Authors Hamza Išerić, Can Alkan, Faraz Hach, Ibrahim Numanagić



PDF
Thumbnail PDF

File

LIPIcs.WABI.2021.15.pdf
  • Filesize: 0.94 MB
  • 18 pages

Document Identifiers

Author Details

Hamza Išerić
  • Department of Computer Science, University of Victoria, Canada
Can Alkan
  • Department of Computer Engineering, Bilkent University, Ankara, Turkey
Faraz Hach
  • Vancouver Prostate Centre, Canada
Ibrahim Numanagić
  • Department of Computer Science, University of Victoria, Canada

Acknowledgements

We thank Haris Smajlović for his invaluable comments and suggestions during the manuscript preparation.

Cite AsGet BibTex

Hamza Išerić, Can Alkan, Faraz Hach, and Ibrahim Numanagić. BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 15:1-15:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.WABI.2021.15

Abstract

The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • genome analysis
  • fast alignment
  • segmental duplications
  • core duplicons
  • sequence decomposition

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Multiple genome alignment: Chaining algorithms revisited. In Ricardo Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, pages 1-16. Springer Berlin Heidelberg, 2003. Google Scholar
  2. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403-410, October 1990. URL: https://doi.org/10.1016/S0022-2836(05)80360-2.
  3. Shanika L. Amarasinghe, Shian Su, Xueyi Dong, Luke Zappia, Matthew E. Ritchie, and Quentin Gouil. Opportunities and challenges in long-read sequencing data analysis. Genome Biology, 21:30, 2020. URL: https://doi.org/10.1186/s13059-020-1935-5.
  4. A. Andoni, R. Krauthgamer, and K. Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Proc. IEEE 51st Annual Symp. Foundations of Computer Science, pages 377-386, October 2010. URL: https://doi.org/10.1109/FOCS.2010.43.
  5. Francesca Antonacci, Jeffrey M Kidd, Tomas Marques-Bonet, Brian Teague, Mario Ventura, Santhosh Girirajan, Can Alkan, Catarina D Campbell, Laura Vives, Maika Malig, Jill A Rosenfeld, Blake C Ballif, Lisa G Shaffer, Tina A Graves, Richard K Wilson, David C Schwartz, and Evan E Eichler. A large and complex structural polymorphism at 16p12.1 underlies microdeletion disease risk. Nat Genet, 42(9):745-750, September 2010. URL: https://doi.org/10.1038/ng.643.
  6. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC '15, pages 51-58, New York, NY, USA, 2015. ACM. URL: https://doi.org/10.1145/2746539.2746612.
  7. J. A. Bailey, J. M. Kidd, and E. E. Eichler. Human copy number polymorphic genes. Cytogenet Genome Res, 123(1-4):234-243, 2008. URL: https://doi.org/10.1159/000184713.
  8. J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res, 11(6):1005-1017, June 2001. URL: https://doi.org/10.1101/gr.187101.
  9. Jeffrey A Bailey and Evan E Eichler. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet, 7(7):552-564, July 2006. URL: https://doi.org/10.1038/nrg1895.
  10. Stuart Cantsilieris, Susan M. Sunkin, Matthew E. Johnson, Fabio Anaclerio, John Huddleston, Carl Baker, Max L. Dougherty, Jason G. Underwood, Arvis Sulovari, PingHsun Hsieh, Yafei Mao, Claudia Rita Catacchio, Maika Malig, AnneMarie E. Welch, Melanie Sorensen, Katherine M. Munson, Weihong Jiang, Santhosh Girirajan, Mario Ventura, Bruce T. Lamb, Ronald A. Conlon, and Evan E. Eichler. An evolutionary driver of interspersed segmental duplications in primates. Genome biology, 21:202, 2020. URL: https://doi.org/10.1186/s13059-020-02074-4.
  11. Keegan Carruthers-Smith. Sliding window minimum implementations, 2013. last accessed 28 January 2021. URL: https://github.com/keegancsmith/Sliding-Window-Minimum.
  12. Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):233-235, 1979. Google Scholar
  13. Jean-Félix Dallery, Nicolas Lapalu, Antonios Zampounis, Sandrine Pigné, Isabelle Luyten, Joëlle Amselem, Alexander H. J. Wittenberg, Shiguo Zhou, Marisa V. de Queiroz, Guillaume P. Robin, Annie Auger, Matthieu Hainaut, Bernard Henrissat, Ki-Tae Kim, Yong-Hwan Lee, Olivier Lespinet, David C. Schwartz, Michael R. Thon, and Richard J. O'Connell. Gapless genome assembly of colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite gene clusters. BMC genomics, 18:667, 2017. URL: https://doi.org/10.1186/s12864-017-4083-x.
  14. Franklin Delehelle, Sylvain Cussat-Blanc, Jean-Marc Alliot, Hervé Luga, and Patricia Balaresque. ASGART: fast and parallel genome scale segmental duplications mapping. Bioinformatics, 34:2708-2714, 2018. URL: https://doi.org/10.1093/bioinformatics/bty172.
  15. Max L. Dougherty, Jason G. Underwood, Bradley J. Nelson, Elizabeth Tseng, Katherine M. Munson, Osnat Penn, Tomasz J. Nowakowski, Alex A. Pollen, and Evan E. Eichler. Transcriptional fates of human-specific segmental duplications in brain. Genome research, 28:1566-1576, 2018. URL: https://doi.org/10.1101/gr.237610.118.
  16. John W. Drake, Brian Charlesworth, Deborah Charlesworth, and James F. Crow. Rates of spontaneous mutation. Genetics, 148(4):1667-1686, 1998. URL: http://arxiv.org/abs/https://www.genetics.org/content/148/4/1667.full.pdf.
  17. Huan Fan, Anthony R Ives, Yann Surget-Groba, and Charles H Cannon. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics, 16:522, July 2015. URL: https://doi.org/10.1186/s12864-015-1647-5.
  18. Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered, 100(6):659-674, 2009. URL: https://doi.org/10.1093/jhered/esp086.
  19. Santhosh Girirajan, Megan Y. Dennis, Carl Baker, Maika Malig, Bradley P. Coe, Catarina D. Campbell, Kenneth Mark, Tiffany H. Vu, Can Alkan, Ze Cheng, Leslie G. Biesecker, Raphael Bernier, and Evan E. Eichler. Refinement and discovery of new hotspots of copy-number variation associated with autism spectrum disorder. Am J Hum Genet, 92(2):221-237, February 2013. URL: https://doi.org/10.1016/j.ajhg.2012.12.016.
  20. Hiroyuki Hanada, Mineichi Kudo, and Atsuyoshi Nakamura. On practical accuracy of edit distance approximation algorithms. arXiv preprint arXiv:1701.06134, 2017. URL: http://arxiv.org/abs/1701.06134v1.
  21. Robert S. Harris. Improved Pairwise Alignment of Genomic Dna. PhD thesis, Pennsylvania State University, University Park, PA, USA, 2007. AAI3299002. Google Scholar
  22. Xiao Hu and Iddo Friedberg. SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier. GigaScience, 8, October 2019. URL: https://doi.org/10.1093/gigascience/giz118.
  23. Martin Hölzer and Manja Marz. PoSeiDon: a Nextflow pipeline for the detection of evolutionary recombination events and positive selection. Bioinformatics, July 2020. URL: https://doi.org/10.1093/bioinformatics/btaa695.
  24. Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In S. Cenk Sahinalp, editor, Proceedings of 21st Annual International Conference on Research in Computational Molecular Biology (RECOMB 2017), volume 10229, pages 66-81, Cham, 2017. Springer International Publishing. URL: https://doi.org/10.1007/978-3-319-56970-3_5.
  25. Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M Phillippy, and Srinivas Aluru. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics, 34(17):i748-i756, 2018. Google Scholar
  26. Zhaoshi Jiang, Robert Hubley, Arian Smit, and Evan E. Eichler. Dupmasker: a tool for annotating primate segmental duplications. Genome research, 18:1362-1368, August 2008. URL: https://doi.org/10.1101/gr.078477.108.
  27. Zhaoshi Jiang, Haixu Tang, Mario Ventura, Maria Francesca Cardone, Tomas Marques-Bonet, Xinwei She, Pavel A Pevzner, and Evan E Eichler. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nature genetics, 39:1361-1368, November 2007. URL: https://doi.org/10.1038/ng.2007.9.
  28. Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707-710, 1966. Google Scholar
  29. Harris A. Lewin, Gene E. Robinson, W. John Kress, William J. Baker, Jonathan Coddington, Keith A. Crandall, Richard Durbin, Scott V. Edwards, Félix Forest, M. Thomas P. Gilbert, Melissa M. Goldstein, Igor V. Grigoriev, Kevin J. Hackett, David Haussler, Erich D. Jarvis, Warren E. Johnson, Aristides Patrinos, Stephen Richards, Juan Carlos Castilla-Rubio, Marie-Anne van Sluys, Pamela S. Soltis, Xun Xu, Huanming Yang, and Guojie Zhang. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences of the United States of America, 115:4325-4333, April 2018. URL: https://doi.org/10.1073/pnas.1720115115.
  30. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics (Oxford, England), 34:3094-3100, September 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
  31. Tomas Marques-Bonet, Jeffrey M Kidd, Mario Ventura, Tina A Graves, Ze Cheng, LaDeana W Hillier, Zhaoshi Jiang, Carl Baker, Ray Malfavon-Borja, Lucinda A Fulton, Can Alkan, Gozde Aksay, Santhosh Girirajan, Priscillia Siswara, Lin Chen, Maria Francesca Cardone, Arcadi Navarro, Elaine R Mardis, Richard K Wilson, and Evan E Eichler. A burst of segmental duplications in the genome of the African great ape ancestor. Nature, 457(7231):877-881, February 2009. URL: https://doi.org/10.1038/nature07744.
  32. Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey Zimin. MUMmer4: A fast and versatile genome alignment system. PLoS computational biology, 14:e1005944, January 2018. URL: https://doi.org/10.1371/journal.pcbi.1005944.
  33. Ibrahim Numanagić, Alim S Gökkaya, Lillian Zhang, Bonnie Berger, Can Alkan, and Faraz Hach. Fast characterization of segmental duplications in genome assemblies. Bioinformatics, 34:i706-i714, September 2018. URL: https://doi.org/10.1093/bioinformatics/bty586.
  34. Lianrong Pu, Yu Lin, and Pavel A Pevzner. Detection and analysis of ancient segmental duplications in mammalian genomes. Genome research, 28:901-909, June 2018. URL: https://doi.org/10.1101/gr.228718.117.
  35. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76-85. ACM, 2003. Google Scholar
  36. Ariya Shajii, Ibrahim Numanagić, Riyadh Baghdadi, Bonnie Berger, and Saman Amarasinghe. Seq: A high-performance language for bioinformatics. Proc. ACM Program. Lang., 3, October 2019. URL: https://doi.org/10.1145/3360551.
  37. Alaina Shumate and Steven L. Salzberg. Liftoff: accurate mapping of gene annotations. Bioinformatics, December 2020. URL: https://doi.org/10.1093/bioinformatics/btaa1016.
  38. Peter H Sudmant, Jacob O Kitzman, Francesca Antonacci, Can Alkan, Maika Malig, Anya Tsalenko, Nick Sampas, Laurakay Bruhn, Jay Shendure, 1000 Genomes Project, and Evan E Eichler. Diversity of human copy number variation and multicopy genes. Science, 330(6004):641-646, October 2010. URL: https://doi.org/10.1126/science.1197005.
  39. Hajime Suzuki and Masahiro Kasahara. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC bioinformatics, 19(1):33-47, 2018. Google Scholar
  40. O. Tange. GNU Parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42-47, February 2011. URL: https://doi.org/10.5281/zenodo.16303.
  41. Robert Endre Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. J. Comput. Syst. Sci., 18(2):110-127, 1979. URL: https://doi.org/10.1016/0022-0000(79)90042-4.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail