Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Commins, Patty; Liben-Nowell, David; Liu, Tina; Tomlinson, Kiran

doi:10.4230/LIPIcs.CPM.2020.11

Abstract

Algorithms to find optimal alignments among strings, or to find a parsimonious summary of a collection of strings, are well studied in a variety of contexts, addressing a wide range of interesting applications. In this paper, we consider chain letters, which contain a growing sequence of signatories added as the letter propagates. The unusual constellation of features exhibited by chain letters (one-ended growth, divergence, and mutation) make their propagation, and thus the corresponding reconstruction problem, both distinctive and rich. Here, inspired by these chain letters, we formally define the problem of computing an optimal summary of a set of diverging string sequences. From a collection of these sequences of names, with each sequence noisily corresponding to a branch of the unknown tree T representing the letter’s true dissemination, can we efficiently and accurately reconstruct a tree T' ≈ T? In this paper, we give efficient exact algorithms for this summarization problem when the number of sequences is small; for larger sets of sequences, we prove hardness and provide an efficient heuristic algorithm. We evaluate this heuristic on synthetic data sets chosen to emulate real chain letters, showing that our algorithm is competitive with or better than previous approaches, and that it also comes close to finding the true trees in these synthetic datasets.

Lada Adamic, Thomas Lento, Eytan Adar, and Pauline Ng. Information evolution in social networks. In International Conference on Web Search and Data Mining (WSDM'16), 2016.
Sebastian Baltes, Christoph Treude, and Stephan Diehl. Sotorrent: Studying the origin, evolution, and usage of Stack Overflow code snippets. In International Conference on Mining Software Repositories (MSR'19), pages 191-194, 2019.
Charles Bennett, Ming Li, and Bin Ma. Chain letters and evolutionary histories. Scientific American, 288(6):76-81, 2003.
Eric Brill and Robert Moore. An improved error model for noisy channel spelling correction. In Proc. Association for Computational Linguistics (ACL'00), pages 286-293, 2000.
Theresa Bullard, John Freudenthal, Serine Avagyan, and Bart Kahr. Test of Cairns-Smith’s ‘crystals-as-genes’ hypothesis. Faraday Discussions, 136:231-245, 2007.
Alexander Graham Cairns-Smith. Seven clues to the origin of life: a scientific detective story. Cambridge University Press, 1990.
Humberto Carrillo and David Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, pages 1073-1082, 1988.
Flavio Chierichetti, David Liben-Nowell, and Jon Kleinberg. Reconstructing patterns of information diffusion from incomplete observations. In Advances in Neural Information Processing Systems (NeurIPS'11), pages 792-800, 2011.
Vacláv Chvátal and David Sankoff. Longest common subsequences of two random sequences. Journal of Applied Probability, 12(2):306–315, 1975.
Rene De La Briandais. File searching using variable length keys. In Western Joint Computer Conference, pages 295-298, 1959.
Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theoretical Computer Science, 230(1-2):39-48, 2000.
Russell Doolittle. Reconstructing history with amino acid sequences. Protein Science, 1(2):191-200, 1992.
Da-Fei Feng and Russell Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 25(4):351-360, 1987.
Campbell Bryce Fraser. Subsequences and supersequences of strings. PhD thesis, University of Glasgow, 1995.
Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, September 1960.
Adrien Friggeri, Lada Adamic, Dean Eckles, and Justin Cheng. Rumor cascades. In Eighth International AAAI Conference on Weblogs and Social Media (ICWSM'14), 2014.
Benjamin Golub and Matthew Jackson. Using selection bias to explain the observed structure of Internet diffusions. Proceedings of the National Academy of Sciences, 107(23):10833-10836, 2010.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
Tim Henderson. Zhang-Shasha: Tree edit distance in Python. https://github.com/timtadh/zhang-shasha, 2019. URL: https://github.com/timtadh/zhang-shasha.
Manoel Horta Ribeiro, Kristina Gligoric, and Robert West. Message distortion in information cascades. In The World Wide Web Conference (WWW'19), pages 681-692, 2019.
Folgert Karsdorp and Antal Van den Bosch. The structure and evolution of story networks. Royal Society Open Science, 3(6):160071, 2016.
John Kececioglu. The maximum weight trace problem in multiple sequence alignment. In Symposium on Combinatorial Pattern Matching (CPM'93), pages 106-119, 1993.
Joseph Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Review, 25(2):201-237, 1983.
Ravi Kumar, Mohammad Mahdian, and Mary McGlohon. Dynamics of conversations. In Intl. Conference on Knowledge Discovery and Data Mining (KDD'10), pages 553-562, 2010.
Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Intl. Conference on Knowledge Discovery and Data Mining (KDD'09), pages 497-506, 2009.
Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707-710, 1966.
Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5):473–483, 2010.
Ming Li, Bin Ma, and Lusheng Wang. Finding similar regions in many sequences. Journal of Computer and System Sciences, 65(1):73-96, 2002.
David Liben-Nowell and Jon Kleinberg. Tracing information flow on a global scale using Internet chain-letter data. Proceedings of the National Academy of Sciences, 105(12):4633-4638, 2008.
Daniel Lopresti. Block edit models for approximate string matching. Theoretical Computer Science, 181(1):159-179, 1997.
Luke Matthews, Jamie Tehrani, Fiona Jordan, Mark Collard, and Charles Nunn. Testing for divergent transmission histories among cultural characters: A study using Bayesian phylogenetic methods and Iranian tribal textile data. PLoS ONE, 6(4), 2011.
Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31-88, 2001.
François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. Journal of Discrete Algorithms, 3(2-4):390-415, 2005. Previously in CPM'03.
Kemal Oflazer. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, 22(1):73–89, 1996.
Kari-Jouko Räihä and Esko Ukkonen. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science, 16(2):187-198, 1981.
Benjamin Raphael, Degui Zhi, Haixu Tang, and Pavel Pevzner. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Research, 14(11):2336-2346, 2004.
David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975.
David Sankoff, Robert Cedergren, and Guy Lapalme. Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. Journal of Molecular Evolution, 7(2):133-149, 1976.
Matthew Simmons, Lada Adamic, and Eytan Adar. Memes online: Extracted, subtracted, injected, and recollected. In International Conference on Web and Social Media (ICWSM'11), 2011.
Jamshid Tehrani. The phylogeny of Little Red Riding Hood. PLOS One, 8(11):e78871, 2013.
Robert Wagner. Order-n correction for regular languages. Communications of the ACM, 17(5):265–268, May 1974.
Michael Waterman, Temple Smith, and William Beyer. Some biological sequence metrics. Advances in Mathematics, 20(3):367-387, 1976.
Henry William Watson and Francis Galton. On the probability of the extinction of families. The Journal of the Anthropological Institute of Great Britain and Ireland, 4:138-144, 1875.
Carola Wenk. Applying an edit distance to the matching of tree ring sequences in dendrochronology. In Symposium on Combinatorial Pattern Matching (CPM'99), pages 223-242, 1999.
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245-1262, 1989.
Kaizhong Zhang, Rick Statman, and Dennis Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 42(3):133-139, 1992.

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Authors Patty Commins, David Liben-Nowell, Tina Liu, Kiran Tomlinson

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Authors Patty Commins, David Liben-Nowell, Tina Liu, Kiran Tomlinson

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message