Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster

Authors Qi Wang , R. A. Leo Elworth , Tian Rui Liu , Todd J. Treangen

Thumbnail PDF


  • Filesize: 1.5 MB
  • 12 pages

Document Identifiers

Author Details

Qi Wang
  • Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Rice University, Houston, TX 77005, USA
R. A. Leo Elworth
  • Department of Computer Science, Rice University, Houston, TX 77005, USA
Tian Rui Liu
  • Department of Computer Science, Rice University, Houston, TX 77005, USA
Todd J. Treangen
  • Department of Computer Science, Rice University, Houston, TX 77005, USA


The authors would like to thank Dr. Caleb Bashor for critical discussion and feedback, and Dr. Joanne Kamens from Addgene for providing full access to the synthetic plasmids utilized in this study. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, ARO, or the US Government.

Cite AsGet BibTex

Qi Wang, R. A. Leo Elworth, Tian Rui Liu, and Todd J. Treangen. Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 19:1-19:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


As sequence databases grow, characterizing diversity across extremely large collections of genomes requires the development of efficient methods that avoid costly all-vs-all comparisons [Marschall et al., 2018]. In addition to exponential increases in the amount of natural genomes being sequenced, improved techniques for the creation of human engineered sequences is ushering in a new wave of synthetic genome sequence databases that grow alongside naturally occurring genome databases. In this paper, we analyze the full diversity of available sequenced natural and synthetic plasmid genome sequences. This diversity can be represented by a data structure that captures all presently available nucleotide sequences, known as a pan-genome. In our case, we construct a single linear pan-genome nucleotide sequence that captures this diversity. To process such a large number of sequences, we introduce the plaster algorithmic pipeline. Using plaster we are able to construct the full synthetic plasmid pan-genome from 51,047 synthetic plasmid sequences as well as a natural pan-genome from 6,642 natural plasmid sequences. We demonstrate the efficacy of plaster by comparing its speed against another pan-genome construction method as well as demonstrating that nearly all plasmids align well to their corresponding pan-genome. Finally, we explore the use of pan-genome sequence alignment to distinguish between naturally occurring and synthetic plasmids. We believe this approach will lead to new techniques for rapid characterization of engineered plasmids. Applications for this work include detection of genome editing, tracking an unknown plasmid back to its lab of origin, and identifying naturally occurring sequences that may be of use to the synthetic biology community. The source code for fully reconstructing the natural and synthetic plasmid pan-genomes as well for plaster are publicly available and can be downloaded at

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Applied computing → Molecular sequence analysis
  • Applied computing → Computational genomics
  • comparative genomics
  • sequence alignment
  • pan-genome
  • engineered plasmids


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Jonathan E Allen, Shea N Gardner, and Tom R Slezak. DNA signatures for detecting genetic engineering in bacteria. Genome biology, 9(3):R56, 2008. Google Scholar
  2. Lauren Brooks, Mo Kaze, and Mark Sistrom. A Curated, Comprehensive Database of Plasmid Sequences. Microbiol Resour Announc, 8(1):e01325-18, 2019. Google Scholar
  3. Hans Bügl, John P Danner, Robert J Molinari, John T Mulligan, Han-Oh Park, Bas Reichert, David A Roth, Ralf Wagner, Bruce Budowle, Robert M Scripp, et al. DNA synthesis and biological security. Nature biotechnology, 25(6):627, 2007. Google Scholar
  4. Jean Cury, Pedro H Oliveira, Fernando de la Cruz, and Eduardo PC Rocha. Host Range and Genetic Plasticity Explain the Coexistence of Integrative and Extrachromosomal Mobile Genetic Elements. Molecular biology and evolution, 35(9):2230-2239, 2018. Google Scholar
  5. Robert C Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19):2460-2461, 2010. Google Scholar
  6. Mark Eppinger, Talima Pearson, Sara SK Koenig, Ofori Pearson, Nathan Hicks, Sonia Agrawal, Fatemeh Sanjar, Kevin Galens, Sean Daugherty, Jonathan Crabtree, et al. Genomic epidemiology of the Haitian cholera outbreak: a single introduction followed by rapid, extensive, and continued spread characterized the onset of the epidemic. MBio, 5(6):e01721-14, 2014. Google Scholar
  7. Corinna Ernst and Sven Rahmann. PanCake: a data structure for pangenomes. In German Conference on Bioinformatics 2013. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2013. Google Scholar
  8. Derrick E Fouts, Lauren Brinkac, Erin Beck, Jason Inman, and Granger Sutton. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic acids research, 40(22):e172-e172, 2012. Google Scholar
  9. Valentina Galata, Tobias Fehlmann, Christina Backes, and Andreas Keller. PLSDB: a resource of complete bacterial plasmids. Nucleic acids research, 47(D1):D195-D202, 2018. Google Scholar
  10. Chris D Greenman, Erin D Pleasance, Scott Newman, Fengtang Yang, Beiyuan Fu, Serena Nik-Zainal, David Jones, King Wai Lau, Nigel Carter, Paul AW Edwards, et al. Estimation of rearrangement phylogeny for cancer genomes. Genome research, 22(2):346-361, 2012. Google Scholar
  11. Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018. Google Scholar
  12. Finbarr Hayes. The function and organization of plasmids. In E. coli Plasmid Vectors, pages 1-17. Springer, 2003. Google Scholar
  13. Melanie Herscovitch, Eric Perkins, Andy Baltus, and Melina Fan. Addgene provides an open forum for plasmid sharing. Nature biotechnology, 30(4):316, 2012. Google Scholar
  14. Christine Jandrasits, Piotr W Dabrowski, Stephan Fuchs, and Bernhard Y Renard. seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment. BMC genomics, 19(1):47, 2018. Google Scholar
  15. Wlodek Mandecki, Mark A Hayden, Mary Ann Shallcross, and Elizabeth Stotland. A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli. Gene, 94(1):103-107, 1990. Google Scholar
  16. Guillaume Marçais, Arthur L Delcher, Adam M Phillippy, Rachel Coston, Steven L Salzberg, and Aleksey Zimin. MUMmer4: a fast and versatile genome alignment system. PLoS computational biology, 14(1):e1005944, 2018. Google Scholar
  17. Tobias Marschall, Manja Marz, Thomas Abeel, Louis Dijkstra, Bas E Dutilh, Ali Ghaffaari, Paul Kersey, Wigard P Kloosterman, Veli Makinen, Adam M Novak, et al. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118-135, 2018. Google Scholar
  18. Elizabeth Anne McMillan, Sushim K Gupta, Laura Williams, Thomas Jové, Lari M Hiott, Tiffanie A Woodley, John B Barrett, Charlene Renee Jackson, Jamie L Waslienko, Mustafa Simmons, et al. Antimicrobial Resistance Genes, Cassettes, and Plasmids present in Salmonella enterica associated with US Food Animals. Frontiers in microbiology, 10:832, 2019. Google Scholar
  19. Alec AK Nielsen and Christopher A Voigt. Deep learning to predict the lab-of-origin of engineered DNA. Nature communications, 9(1):3135, 2018. Google Scholar
  20. Teresa Nogueira, Daniel J Rankin, Marie Touchon, François Taddei, Sam P Brown, and Eduardo PC Rocha. Horizontal gene transfer of the secretome drives the evolution of bacterial cooperation and virulence. Current Biology, 19(20):1683-1691, 2009. Google Scholar
  21. Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen, et al. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Schloss Dagstuhl Leibniz Center for Informatics, 2018. Google Scholar
  22. Ryan S Noyce, Seth Lederman, and David H Evans. Construction of an infectious horsepox virus vaccine from chemically synthesized DNA fragments. PloS one, 13(1):e0188453, 2018. Google Scholar
  23. Nuala A O'Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research, 44(D1):D733-D745, 2015. Google Scholar
  24. Andrew J Page, Carla A Cummins, Martin Hunt, Vanessa K Wong, Sandra Reuter, Matthew TG Holden, Maria Fookes, Daniel Falush, Jacqueline A Keane, and Julian Parkhill. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22):3691-3693, 2015. Google Scholar
  25. Adam M Phillippy, Michael C Schatz, and Mihai Pop. Genome assembly forensics: finding the elusive mis-assembly. Genome biology, 9(3):R55, 2008. Google Scholar
  26. Torsten Seemann. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069, 2014. Google Scholar
  27. Hervé Tettelin, Vega Masignani, Michael J Cieslewicz, Claudio Donati, Duccio Medini, Naomi L Ward, Samuel V Angiuoli, Jonathan Crabtree, Amanda L Jones, A Scott Durkin, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proceedings of the National Academy of Sciences, 102(39):13950-13955, 2005. Google Scholar
  28. Harry A Thorpe, Sion C Bayliss, Laurence D Hurst, and Edward J Feil. Comparative analyses of selection operating on nontranslated intergenic regions of diverse bacterial species. Genetics, 206(1):363-376, 2017. Google Scholar
  29. Harry A Thorpe, Sion C Bayliss, Samuel K Sheppard, and Edward J Feil. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria. Gigascience, 7(4):giy015, 2018. Google Scholar
  30. Esko Ukkonen. Finding founder sequences from a set of recombinants. In International Workshop on Algorithms in Bioinformatics, pages 277-286. Springer, 2002. Google Scholar
  31. George Vernikos, Duccio Medini, David R Riley, and Herve Tettelin. Ten years of pan-genome analyses. Current opinion in microbiology, 23:148-154, 2015. Google Scholar
  32. Barry L Wanner. Molecular cloning of Mu d (bla lacZ) transcriptional and translational fusions. Journal of bacteriology, 169(5):2026-2030, 1987. Google Scholar
  33. Tom A Williams, Peter G Foster, Cymon J Cox, and T Martin Embley. An archaeal origin of eukaryotes supports only two primary domains of life. Nature, 504(7479):231, 2013. Google Scholar
  34. Derrick E Wood, Henry Lin, Ami Levy-Moonshine, Rajiswari Swaminathan, Yi-Chien Chang, Brian P Anton, Lais Osmani, Martin Steffen, Simon Kasif, and Steven L Salzberg. Thousands of missed genes found in bacterial genomes and their analysis with COMBREX. Biology direct, 7(1):37, 2012. Google Scholar
  35. Yongbing Zhao, Jiayan Wu, Junhui Yang, Shixiang Sun, Jingfa Xiao, and Jun Yu. PGAP: pan-genomes analysis pipeline. Bioinformatics, 28(3):416-418, 2011. Google Scholar