Seed-driven Learning of Position Probability Matrices from Large Sequence Sets

Authors Jarkko Toivonen, Jussi Taipale, Esko Ukkonen

Thumbnail PDF


  • Filesize: 0.56 MB
  • 13 pages

Document Identifiers

Author Details

Jarkko Toivonen
Jussi Taipale
Esko Ukkonen

Cite AsGet BibTex

Jarkko Toivonen, Jussi Taipale, and Esko Ukkonen. Seed-driven Learning of Position Probability Matrices from Large Sequence Sets. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 25:1-25:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


We formulate and analyze a novel seed-driven algorithm SeedHam for PPM learning. To learn a PPM of length l, the algorithm uses the most frequent l-mer of the training data as a seed, and then restricts the learning into a small Hamming neighbourhood of the seed. The SeedHam method is intended for PPM learning from large sequence sets (up to hundreds of Mbases) containing enriched motif instances. A robust variant of the method is introduced that decreases contamination from artefact instances of the motif and thereby allows using larger Hamming neighbourhoods. To solve the motif orientation problem in two-stranded DNA we introduce a novel seed finding rule, based on analysis of the palindromic structure of sequences. Test experiments are reported, that illustrate the relative strengths of different variants of our methods, and show that our algorithms are fast and give stable and accurate results. Availability and implementation: A C++ implementation of the method is available from Contact:
  • motif finding
  • transcription factor binding site
  • sequence analysis
  • Hamming distance
  • seed


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Timothy L. Bailey. Dreme: motif discovery in transcription factor chip-seq data. Bioinformatics, 27(12):1653, 2011. Google Scholar
  2. Michael F. Berger, Anthony A. Philippakis, Aaron M. Qureshi, Fangxue S. He, Preston W. Estep, and Martha L. Bulyk. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotech., 24(11):1429-1435, 2006. Google Scholar
  3. Mathieu Blanchette and Saurabh Sinha. Separating real motifs from their artifacts. Bioinformatics, 17(SUPPL. 1), 2001. Google Scholar
  4. Peter Huggins, Shan Zhong, Idit Shiff, Rachel Beckerman, Oleg Laptenko, Carol Prives, Marcel H. Schulz, Itamar Simon, and Ziv Bar-Joseph. DECOD: fast and accurate discriminative DNA motif finding. Bioinformatics, 27(17):2361, 2011. Google Scholar
  5. Arttu Jolma, Teemu Kivioja, Jarkko Toivonen, et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res., 20(6):861-873, 2010. Google Scholar
  6. Arttu Jolma, Jian Yan, Thomas Whitington, Jarkko Toivonen, et al. DNA-binding specificities of human transcription factors. Cell, 152(1-2):327-339, 2013. Google Scholar
  7. Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976. Google Scholar
  8. Arnold R. Oliphant, Christopher J. Brandl, and Kevin Struhl. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol., 9(7):2944-2949, 1989. Google Scholar
  9. Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole. In silico representation and discovery of transcription factor binding sites. Brief. Bioinformatics, 5(3):217-236, 2004. Google Scholar
  10. Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao, Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, Nina Thiessen, Obi L. Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods, 4(8):651-657, 2007. Google Scholar
  11. Albin Sandelin, Wynand Alkema, Par Engstrom, Wyeth W. Wasserman, and Boris Lenhard. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32(Database issue):D91-94, 2004. Google Scholar
  12. Gary D Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16(1):16-23, 2000. Google Scholar
  13. Gary D. Stormo, Thomas D. Schneider, Larry Gold, and Andrzej Ehrenfeucht. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res., 10(9):2997-3011, 1982. Google Scholar
  14. Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church, Bart De Moor, Eleazar Eskin, Alexander V. Favorov, Martin C. Frith, Yutao Fu, W. James Kent, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology, 23(1):137-144, 2005. Google Scholar
  15. Craig Tuerk and Larry Gold. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249(4968):505-510, 1990. Google Scholar
  16. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. Google Scholar
  17. Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT'08. IEEE Conference Record of 14th Annual Symposium on, pages 1-11, 1973. Google Scholar
  18. Edgar Wingender. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief. Bioinformatics, 9(4):326-332, 2008. Google Scholar