Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Authors Patty Commins, David Liben-Nowell, Tina Liu, Kiran Tomlinson



PDF
Thumbnail PDF

File

LIPIcs.CPM.2020.11.pdf
  • Filesize: 0.64 MB
  • 15 pages

Document Identifiers

Author Details

Patty Commins
  • Department of Computer Science, Carleton College, Northfield, MN, USA
  • Department of Mathematics, University of Minnesota, Minneapolis, MN, USA
David Liben-Nowell
  • Department of Computer Science, Carleton College, Northfield, MN, USA
Tina Liu
  • Department of Computer Science, Carleton College, Northfield, MN, USA
  • Surescripts, Minneapolis, MN, USA
Kiran Tomlinson
  • Department of Computer Science, Carleton College, Northfield, MN, USA
  • Department of Computer Science, Cornell University, Ithaca, NY, USA

Acknowledgements

We thank Jon Kleinberg for extensive discussions, and Anna Johnson, Hailey Jones, Dave Musicant, Layla Oesper, Anna Rafferty, and Ethan Somes for helpful discussions during preliminary or late stages of this project.

Cite AsGet BibTex

Patty Commins, David Liben-Nowell, Tina Liu, and Kiran Tomlinson. Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 11:1-11:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.CPM.2020.11

Abstract

Algorithms to find optimal alignments among strings, or to find a parsimonious summary of a collection of strings, are well studied in a variety of contexts, addressing a wide range of interesting applications. In this paper, we consider chain letters, which contain a growing sequence of signatories added as the letter propagates. The unusual constellation of features exhibited by chain letters (one-ended growth, divergence, and mutation) make their propagation, and thus the corresponding reconstruction problem, both distinctive and rich. Here, inspired by these chain letters, we formally define the problem of computing an optimal summary of a set of diverging string sequences. From a collection of these sequences of names, with each sequence noisily corresponding to a branch of the unknown tree T representing the letter’s true dissemination, can we efficiently and accurately reconstruct a tree T' ≈ T? In this paper, we give efficient exact algorithms for this summarization problem when the number of sequences is small; for larger sets of sequences, we prove hardness and provide an efficient heuristic algorithm. We evaluate this heuristic on synthetic data sets chosen to emulate real chain letters, showing that our algorithm is competitive with or better than previous approaches, and that it also comes close to finding the true trees in these synthetic datasets.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Combinatorial algorithms
  • Applied computing → Law, social and behavioral sciences
Keywords
  • edit distance
  • tree reconstruction
  • information propagation
  • chain letters

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Lada Adamic, Thomas Lento, Eytan Adar, and Pauline Ng. Information evolution in social networks. In International Conference on Web Search and Data Mining (WSDM'16), 2016. Google Scholar
  2. Sebastian Baltes, Christoph Treude, and Stephan Diehl. Sotorrent: Studying the origin, evolution, and usage of Stack Overflow code snippets. In International Conference on Mining Software Repositories (MSR'19), pages 191-194, 2019. Google Scholar
  3. Charles Bennett, Ming Li, and Bin Ma. Chain letters and evolutionary histories. Scientific American, 288(6):76-81, 2003. Google Scholar
  4. Eric Brill and Robert Moore. An improved error model for noisy channel spelling correction. In Proc. Association for Computational Linguistics (ACL'00), pages 286-293, 2000. Google Scholar
  5. Theresa Bullard, John Freudenthal, Serine Avagyan, and Bart Kahr. Test of Cairns-Smith’s ‘crystals-as-genes’ hypothesis. Faraday Discussions, 136:231-245, 2007. Google Scholar
  6. Alexander Graham Cairns-Smith. Seven clues to the origin of life: a scientific detective story. Cambridge University Press, 1990. Google Scholar
  7. Humberto Carrillo and David Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, pages 1073-1082, 1988. Google Scholar
  8. Flavio Chierichetti, David Liben-Nowell, and Jon Kleinberg. Reconstructing patterns of information diffusion from incomplete observations. In Advances in Neural Information Processing Systems (NeurIPS'11), pages 792-800, 2011. Google Scholar
  9. Vacláv Chvátal and David Sankoff. Longest common subsequences of two random sequences. Journal of Applied Probability, 12(2):306–315, 1975. Google Scholar
  10. Rene De La Briandais. File searching using variable length keys. In Western Joint Computer Conference, pages 295-298, 1959. Google Scholar
  11. Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theoretical Computer Science, 230(1-2):39-48, 2000. Google Scholar
  12. Russell Doolittle. Reconstructing history with amino acid sequences. Protein Science, 1(2):191-200, 1992. Google Scholar
  13. Da-Fei Feng and Russell Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 25(4):351-360, 1987. Google Scholar
  14. Campbell Bryce Fraser. Subsequences and supersequences of strings. PhD thesis, University of Glasgow, 1995. Google Scholar
  15. Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, September 1960. Google Scholar
  16. Adrien Friggeri, Lada Adamic, Dean Eckles, and Justin Cheng. Rumor cascades. In Eighth International AAAI Conference on Weblogs and Social Media (ICWSM'14), 2014. Google Scholar
  17. Benjamin Golub and Matthew Jackson. Using selection bias to explain the observed structure of Internet diffusions. Proceedings of the National Academy of Sciences, 107(23):10833-10836, 2010. Google Scholar
  18. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. Google Scholar
  19. Tim Henderson. Zhang-Shasha: Tree edit distance in Python. https://github.com/timtadh/zhang-shasha, 2019. URL: https://github.com/timtadh/zhang-shasha.
  20. Manoel Horta Ribeiro, Kristina Gligoric, and Robert West. Message distortion in information cascades. In The World Wide Web Conference (WWW'19), pages 681-692, 2019. Google Scholar
  21. Folgert Karsdorp and Antal Van den Bosch. The structure and evolution of story networks. Royal Society Open Science, 3(6):160071, 2016. Google Scholar
  22. John Kececioglu. The maximum weight trace problem in multiple sequence alignment. In Symposium on Combinatorial Pattern Matching (CPM'93), pages 106-119, 1993. Google Scholar
  23. Joseph Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Review, 25(2):201-237, 1983. Google Scholar
  24. Ravi Kumar, Mohammad Mahdian, and Mary McGlohon. Dynamics of conversations. In Intl. Conference on Knowledge Discovery and Data Mining (KDD'10), pages 553-562, 2010. Google Scholar
  25. Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Intl. Conference on Knowledge Discovery and Data Mining (KDD'09), pages 497-506, 2009. Google Scholar
  26. Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707-710, 1966. Google Scholar
  27. Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5):473–483, 2010. Google Scholar
  28. Ming Li, Bin Ma, and Lusheng Wang. Finding similar regions in many sequences. Journal of Computer and System Sciences, 65(1):73-96, 2002. Google Scholar
  29. David Liben-Nowell and Jon Kleinberg. Tracing information flow on a global scale using Internet chain-letter data. Proceedings of the National Academy of Sciences, 105(12):4633-4638, 2008. Google Scholar
  30. Daniel Lopresti. Block edit models for approximate string matching. Theoretical Computer Science, 181(1):159-179, 1997. Google Scholar
  31. Luke Matthews, Jamie Tehrani, Fiona Jordan, Mark Collard, and Charles Nunn. Testing for divergent transmission histories among cultural characters: A study using Bayesian phylogenetic methods and Iranian tribal textile data. PLoS ONE, 6(4), 2011. Google Scholar
  32. Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, 2001. Google Scholar
  33. François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. Journal of Discrete Algorithms, 3(2-4):390-415, 2005. Previously in CPM'03. Google Scholar
  34. Kemal Oflazer. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, 22(1):73–89, 1996. Google Scholar
  35. Kari-Jouko Räihä and Esko Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science, 16(2):187-198, 1981. Google Scholar
  36. Benjamin Raphael, Degui Zhi, Haixu Tang, and Pavel Pevzner. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Research, 14(11):2336-2346, 2004. Google Scholar
  37. David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. Google Scholar
  38. David Sankoff, Robert Cedergren, and Guy Lapalme. Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. Journal of Molecular Evolution, 7(2):133-149, 1976. Google Scholar
  39. Matthew Simmons, Lada Adamic, and Eytan Adar. Memes online: Extracted, subtracted, injected, and recollected. In International Conference on Web and Social Media (ICWSM'11), 2011. Google Scholar
  40. Jamshid Tehrani. The phylogeny of Little Red Riding Hood. PLOS One, 8(11):e78871, 2013. Google Scholar
  41. Robert Wagner. Order-n correction for regular languages. Communications of the ACM, 17(5):265–268, May 1974. Google Scholar
  42. Michael Waterman, Temple Smith, and William Beyer. Some biological sequence metrics. Advances in Mathematics, 20(3):367-387, 1976. Google Scholar
  43. Henry William Watson and Francis Galton. On the probability of the extinction of families. The Journal of the Anthropological Institute of Great Britain and Ireland, 4:138-144, 1875. Google Scholar
  44. Carola Wenk. Applying an edit distance to the matching of tree ring sequences in dendrochronology. In Symposium on Combinatorial Pattern Matching (CPM'99), pages 223-242, 1999. Google Scholar
  45. Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245-1262, 1989. Google Scholar
  46. Kaizhong Zhang, Rick Statman, and Dennis Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133-139, 1992. Google Scholar