Grammars for Document Spanners

Author Liat Peterfreund



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2021.7.pdf
  • Filesize: 0.82 MB
  • 18 pages

Document Identifiers

Author Details

Liat Peterfreund
  • DI ENS, ENS, CNRS, PSL University, Paris, France
  • Inria, Paris, France

Acknowledgements

I would like to thank the anonymous reviewers for their extremely valuable comments, and in particular those that enabled me to extend the enumeration algorithm to a broader class of extraction grammars. I am grateful to Arnaud Durand, Michael Kaminski, Benny Kimelfeld, and Leonid Libkin for useful discussions, and to Dominik D. Freydenberger for references.

Cite As Get BibTex

Liat Peterfreund. Grammars for Document Spanners. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.ICDT.2021.7

Abstract

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.

Subject Classification

ACM Subject Classification
  • Information systems → Information extraction
  • Information systems → Relational database model
  • Information systems → Data model extensions
Keywords
  • Information Extraction
  • Document Spanners
  • Context-Free Grammars
  • Constant-Delay Enumeration
  • Regular Expressions
  • Pushdown Automata

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49-58. ACM, 2013. Google Scholar
  2. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1-22:19, 2019. Google Scholar
  3. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89-103, 2019. Google Scholar
  4. Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in social media feeds. In ACL, pages 389-398. The Association for Computer Linguistics, 2011. Google Scholar
  5. J. Richard Büchi and Lawrence H. Landweber. Definability in the monadic second-order theory of successor. J. Symb. Log., 34(2):166-170, 1969. Google Scholar
  6. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128-137, 2010. Google Scholar
  7. Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In PODS, pages 149-163, 2019. Google Scholar
  8. Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. In ICDT, volume 155 of LIPIcs, pages 8:1-8:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. Google Scholar
  9. Pál Dömösi. Unusual algorithms for lexicographical enumeration. Acta Cybernetica, 14(3):461-468, 2000. Google Scholar
  10. Jay Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94-102, 1970. Google Scholar
  11. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015. Google Scholar
  12. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst., 41(1):6:1-6:44, 2016. Google Scholar
  13. Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, pages 165-177. ACM, 2018. Google Scholar
  14. Dominik D. Freydenberger. A logic for document spanners. In ICDT, volume 68 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  15. Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019. Google Scholar
  16. Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. In ICDT, volume 48 of LIPIcs, pages 17:1-17:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. Google Scholar
  17. Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854-898, 2018. Google Scholar
  18. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, pages 137-149, 2018. Google Scholar
  19. Dominik D. Freydenberger and Liat Peterfreund. Finite models and the theory of concatenation. CoRR, abs/1912.06110, 2019. URL: http://arxiv.org/abs/1912.06110.
  20. Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In ICDT, volume 155, pages 11:1-11:21, 2020. Google Scholar
  21. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation - international edition (2. ed). Addison-Wesley, 2003. Google Scholar
  22. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 3rd Edition. Pearson international edition. Addison-Wesley, 2007. Google Scholar
  23. Donald E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127-145, 1968. Google Scholar
  24. Donald E. Knuth. Correction: Semantics of context-free languages. Mathematical Systems Theory, 5(1):95-96, 1971. Google Scholar
  25. Clemens Lautemann, Thomas Schwentick, and Denis Thérien. Logics for context-free languages. In CSL, volume 933 of Lecture Notes in Computer Science, pages 205-216. Springer, 1994. Google Scholar
  26. Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, First International Workshop, Sheffield, UK, September 7-10, 2004, Revised Lectures, pages 319-339, 2004. Google Scholar
  27. Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011. Google Scholar
  28. Erkki Mäkinen. On lexicographic enumeration of regular and context-free languages. Acta Cybernetica, 13(1):55-61, 1997. Google Scholar
  29. Francisco Maturana, Cristian Riveros, and Domagoj Vrgoč. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-136, 2018. Google Scholar
  30. Andrea Moro, Marco Tettamanti, Daniela Perani, Caterina Donati, Stefano F Cappa, and Ferruccio Fazio. Syntax and the brain: disentangling grammar by selective anomalies. Neuroimage, 13(1):110-118, 2001. Google Scholar
  31. Takashi Nagashima. A formal deductive system for CFG. Hitotsubashi journal of arts and sciences, 28(1):39-43, 1987. Google Scholar
  32. Yoav Nahshon, Liat Peterfreund, and Stijn Vansummeren. Incorporating information extraction in the relational database model. In WebDB, page 6. ACM, 2016. Google Scholar
  33. Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320-334. ACM, 2019. Google Scholar
  34. Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, volume 127 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  35. Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60-67, 2016. Google Scholar
  36. Markus L. Schmid. Characterising REGEX languages by regular languages equipped with factor-referencing. Inf. Comput., 249:1-17, 2016. Google Scholar
  37. Rania A Abul Seoud, Abou-Bakr M Youssef, and Yasser M Kadah. Extraction of protein interaction information from unstructured text using a link grammar parser. In 2007 International Conference on Computer Engineering & Systems, pages 70-75. IEEE, 2007. Google Scholar
  38. Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, pages 1033-1044, 2007. Google Scholar
  39. Jason M Smith and David Stotts. SPQR: Flexible automated design pattern extraction from source code. In 18th IEEE International Conference on Automated Software Engineering, pages 215-224. IEEE, 2003. Google Scholar
  40. Leslie G. Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 10(2):308-315, 1975. URL: https://doi.org/10.1016/S0022-0000(75)80046-8.
  41. Huang Wen-Ji. Enumerating sentences of context free language based on first one in order. Journal of Computer Research and Development, 41(1):9-14, 2004. Google Scholar
  42. Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In STOC, pages 887-898. ACM, 2012. Google Scholar
  43. Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. MedEx: a medication information extraction system for clinical narratives. JAMIA, 17(1):19-24, 2010. URL: https://doi.org/10.1197/jamia.M3378.
  44. Akane Yakushiji, Yuka Tateisi, Yusuke Miyao, and Jun-ichi Tsujii. Event extraction from biomedical papers using a full parser. In Biocomputing 2001, pages 408-419. World Scientific, 2000. Google Scholar
  45. Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. Navigating the intranet with high precision. In WWW, pages 491-500. ACM, 2007. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail