Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster

Wang, Qi; Elworth, R. A. Leo; Liu, Tian Rui; Treangen, Todd J.

doi:10.4230/LIPIcs.WABI.2019.19

File

LIPIcs.WABI.2019.19.pdf

Filesize: 1.5 MB
12 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2019.19
URN: urn:nbn:de:0030-drops-110492

Author Details

Qi Wang

Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Rice University, Houston, TX 77005, USA

R. A. Leo Elworth

Department of Computer Science, Rice University, Houston, TX 77005, USA

Tian Rui Liu

Department of Computer Science, Rice University, Houston, TX 77005, USA

Todd J. Treangen

Department of Computer Science, Rice University, Houston, TX 77005, USA

Acknowledgements

The authors would like to thank Dr. Caleb Bashor for critical discussion and feedback, and Dr. Joanne Kamens from Addgene for providing full access to the synthetic plasmids utilized in this study. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, ARO, or the US Government.

Cite AsGet BibTex

Qi Wang, R. A. Leo Elworth, Tian Rui Liu, and Todd J. Treangen. Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 19:1-19:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.WABI.2019.19

Abstract

As sequence databases grow, characterizing diversity across extremely large collections of genomes requires the development of efficient methods that avoid costly all-vs-all comparisons [Marschall et al., 2018]. In addition to exponential increases in the amount of natural genomes being sequenced, improved techniques for the creation of human engineered sequences is ushering in a new wave of synthetic genome sequence databases that grow alongside naturally occurring genome databases. In this paper, we analyze the full diversity of available sequenced natural and synthetic plasmid genome sequences. This diversity can be represented by a data structure that captures all presently available nucleotide sequences, known as a pan-genome. In our case, we construct a single linear pan-genome nucleotide sequence that captures this diversity. To process such a large number of sequences, we introduce the plaster algorithmic pipeline. Using plaster we are able to construct the full synthetic plasmid pan-genome from 51,047 synthetic plasmid sequences as well as a natural pan-genome from 6,642 natural plasmid sequences. We demonstrate the efficacy of plaster by comparing its speed against another pan-genome construction method as well as demonstrating that nearly all plasmids align well to their corresponding pan-genome. Finally, we explore the use of pan-genome sequence alignment to distinguish between naturally occurring and synthetic plasmids. We believe this approach will lead to new techniques for rapid characterization of engineered plasmids. Applications for this work include detection of genome editing, tracking an unknown plasmid back to its lab of origin, and identifying naturally occurring sequences that may be of use to the synthetic biology community. The source code for fully reconstructing the natural and synthetic plasmid pan-genomes as well for plaster are publicly available and can be downloaded at https://gitlab.com/qiwangrice/plaster.git.

Subject Classification

ACM Subject Classification

Applied computing → Bioinformatics
Applied computing → Molecular sequence analysis
Applied computing → Computational genomics

Keywords

comparative genomics
sequence alignment
pan-genome
engineered plasmids

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Jonathan E Allen, Shea N Gardner, and Tom R Slezak. DNA signatures for detecting genetic engineering in bacteria. Genome biology, 9(3):R56, 2008.
Lauren Brooks, Mo Kaze, and Mark Sistrom. A Curated, Comprehensive Database of Plasmid Sequences. Microbiol Resour Announc, 8(1):e01325-18, 2019.
Hans Bügl, John P Danner, Robert J Molinari, John T Mulligan, Han-Oh Park, Bas Reichert, David A Roth, Ralf Wagner, Bruce Budowle, Robert M Scripp, et al. DNA synthesis and biological security. Nature biotechnology, 25(6):627, 2007.
Jean Cury, Pedro H Oliveira, Fernando de la Cruz, and Eduardo PC Rocha. Host Range and Genetic Plasticity Explain the Coexistence of Integrative and Extrachromosomal Mobile Genetic Elements. Molecular biology and evolution, 35(9):2230-2239, 2018.
Robert C Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19):2460-2461, 2010.
Mark Eppinger, Talima Pearson, Sara SK Koenig, Ofori Pearson, Nathan Hicks, Sonia Agrawal, Fatemeh Sanjar, Kevin Galens, Sean Daugherty, Jonathan Crabtree, et al. Genomic epidemiology of the Haitian cholera outbreak: a single introduction followed by rapid, extensive, and continued spread characterized the onset of the epidemic. MBio, 5(6):e01721-14, 2014.
Corinna Ernst and Sven Rahmann. PanCake: a data structure for pangenomes. In German Conference on Bioinformatics 2013. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2013.
Derrick E Fouts, Lauren Brinkac, Erin Beck, Jason Inman, and Granger Sutton. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic acids research, 40(22):e172-e172, 2012.
Valentina Galata, Tobias Fehlmann, Christina Backes, and Andreas Keller. PLSDB: a resource of complete bacterial plasmids. Nucleic acids research, 47(D1):D195-D202, 2018.
Chris D Greenman, Erin D Pleasance, Scott Newman, Fengtang Yang, Beiyuan Fu, Serena Nik-Zainal, David Jones, King Wai Lau, Nigel Carter, Paul AW Edwards, et al. Estimation of rearrangement phylogeny for cancer genomes. Genome research, 22(2):346-361, 2012.
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018.
Finbarr Hayes. The function and organization of plasmids. In E. coli Plasmid Vectors, pages 1-17. Springer, 2003.
Melanie Herscovitch, Eric Perkins, Andy Baltus, and Melina Fan. Addgene provides an open forum for plasmid sharing. Nature biotechnology, 30(4):316, 2012.
Christine Jandrasits, Piotr W Dabrowski, Stephan Fuchs, and Bernhard Y Renard. seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment. BMC genomics, 19(1):47, 2018.
Wlodek Mandecki, Mark A Hayden, Mary Ann Shallcross, and Elizabeth Stotland. A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli. Gene, 94(1):103-107, 1990.
Guillaume Marçais, Arthur L Delcher, Adam M Phillippy, Rachel Coston, Steven L Salzberg, and Aleksey Zimin. MUMmer4: a fast and versatile genome alignment system. PLoS computational biology, 14(1):e1005944, 2018.
Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard P Kloosterman, Veli Makinen, Adam M Novak, et al. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118-135, 2018.
Elizabeth Anne McMillan, Sushim K Gupta, Laura Williams, Thomas Jové, Lari M Hiott, Tiffanie A Woodley, John B Barrett, Charlene Renee Jackson, Jamie L Waslienko, Mustafa Simmons, et al. Antimicrobial Resistance Genes, Cassettes, and Plasmids present in Salmonella enterica associated with US Food Animals. Frontiers in microbiology, 10:832, 2019.
Alec AK Nielsen and Christopher A Voigt. Deep learning to predict the lab-of-origin of engineered DNA. Nature communications, 9(1):3135, 2018.
Teresa Nogueira, Daniel J Rankin, Marie Touchon, François Taddei, Sam P Brown, and Eduardo PC Rocha. Horizontal gene transfer of the secretome drives the evolution of bacterial cooperation and virulence. Current Biology, 19(20):1683-1691, 2009.
Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen, et al. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Schloss Dagstuhl Leibniz Center for Informatics, 2018.
Ryan S Noyce, Seth Lederman, and David H Evans. Construction of an infectious horsepox virus vaccine from chemically synthesized DNA fragments. PloS one, 13(1):e0188453, 2018.
Nuala A O'Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research, 44(D1):D733-D745, 2015.
Andrew J Page, Carla A Cummins, Martin Hunt, Vanessa K Wong, Sandra Reuter, Matthew TG Holden, Maria Fookes, Daniel Falush, Jacqueline A Keane, and Julian Parkhill. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22):3691-3693, 2015.
Adam M Phillippy, Michael C Schatz, and Mihai Pop. Genome assembly forensics: finding the elusive mis-assembly. Genome biology, 9(3):R55, 2008.
Torsten Seemann. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069, 2014.
Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proceedings of the National Academy of Sciences, 102(39):13950-13955, 2005.
Harry A Thorpe, Sion C Bayliss, Laurence D Hurst, and Edward J Feil. Comparative analyses of selection operating on nontranslated intergenic regions of diverse bacterial species. Genetics, 206(1):363-376, 2017.
Harry A Thorpe, Sion C Bayliss, Samuel K Sheppard, and Edward J Feil. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. Gigascience, 7(4):giy015, 2018.
Esko Ukkonen. Finding founder sequences from a set of recombinants. In International Workshop on Algorithms in Bioinformatics, pages 277-286. Springer, 2002.
George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin. Ten years of pan-genome analyses. Current opinion in microbiology, 23:148-154, 2015.
Barry L Wanner. Molecular cloning of Mu d (bla lacZ) transcriptional and translational fusions. Journal of bacteriology, 169(5):2026-2030, 1987.
Tom A Williams, Peter G Foster, Cymon J Cox, and T Martin Embley. An archaeal origin of eukaryotes supports only two primary domains of life. Nature, 504(7479):231, 2013.
Derrick E Wood, Henry Lin, Ami Levy-Moonshine, Rajiswari Swaminathan, Yi-Chien Chang, Brian P Anton, Lais Osmani, Martin Steffen, Simon Kasif, and Steven L Salzberg. Thousands of missed genes found in bacterial genomes and their analysis with COMBREX. Biology direct, 7(1):37, 2012.
Yongbing Zhao, Jiayan Wu, Junhui Yang, Shixiang Sun, Jingfa Xiao, and Jun Yu. PGAP: pan-genomes analysis pipeline. Bioinformatics, 28(3):416-418, 2011.