Grammars for Document Spanners

Peterfreund, Liat

doi:10.4230/LIPIcs.ICDT.2021.7

Abstract

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.

Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49-58. ACM, 2013.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1-22:19, 2019.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89-103, 2019.
Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in social media feeds. In ACL, pages 389-398. The Association for Computer Linguistics, 2011.
J. Richard Büchi and Lawrence H. Landweber. Definability in the monadic second-order theory of successor. J. Symb. Log., 34(2):166-170, 1969.
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128-137, 2010.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In PODS, pages 149-163, 2019.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. In ICDT, volume 155 of LIPIcs, pages 8:1-8:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
Pál Dömösi. Unusual algorithms for lexicographical enumeration. Acta Cybernetica, 14(3):461-468, 2000.
Jay Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94-102, 1970.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst., 41(1):6:1-6:44, 2016.
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, pages 165-177. ACM, 2018.
Dominik D. Freydenberger. A logic for document spanners. In ICDT, volume 68 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019.
Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. In ICDT, volume 48 of LIPIcs, pages 17:1-17:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854-898, 2018.
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, pages 137-149, 2018.
Dominik D. Freydenberger and Liat Peterfreund. Finite models and the theory of concatenation. CoRR, abs/1912.06110, 2019. URL: http://arxiv.org/abs/1912.06110.
Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In ICDT, volume 155, pages 11:1-11:21, 2020.
John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation - international edition (2. ed). Addison-Wesley, 2003.
John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 3rd Edition. Pearson international edition. Addison-Wesley, 2007.
Donald E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127-145, 1968.
Donald E. Knuth. Correction: Semantics of context-free languages. Mathematical Systems Theory, 5(1):95-96, 1971.
Clemens Lautemann, Thomas Schwentick, and Denis Thérien. Logics for context-free languages. In CSL, volume 933 of Lecture Notes in Computer Science, pages 205-216. Springer, 1994.
Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, First International Workshop, Sheffield, UK, September 7-10, 2004, Revised Lectures, pages 319-339, 2004.
Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011.
Erkki Mäkinen. On lexicographic enumeration of regular and context-free languages. Acta Cybernetica, 13(1):55-61, 1997.
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoč. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-136, 2018.
Andrea Moro, Marco Tettamanti, Daniela Perani, Caterina Donati, Stefano F Cappa, and Ferruccio Fazio. Syntax and the brain: disentangling grammar by selective anomalies. Neuroimage, 13(1):110-118, 2001.
Takashi Nagashima. A formal deductive system for CFG. Hitotsubashi journal of arts and sciences, 28(1):39-43, 1987.
Yoav Nahshon, Liat Peterfreund, and Stijn Vansummeren. Incorporating information extraction in the relational database model. In WebDB, page 6. ACM, 2016.
Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320-334. ACM, 2019.
Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, volume 127 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019.
Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60-67, 2016.
Markus L. Schmid. Characterising REGEX languages by regular languages equipped with factor-referencing. Inf. Comput., 249:1-17, 2016.
Rania A Abul Seoud, Abou-Bakr M Youssef, and Yasser M Kadah. Extraction of protein interaction information from unstructured text using a link grammar parser. In 2007 International Conference on Computer Engineering & Systems, pages 70-75. IEEE, 2007.
Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, pages 1033-1044, 2007.
Jason M Smith and David Stotts. SPQR: Flexible automated design pattern extraction from source code. In 18th IEEE International Conference on Automated Software Engineering, pages 215-224. IEEE, 2003.
Leslie G. Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 10(2):308-315, 1975. URL: https://doi.org/10.1016/S0022-0000(75)80046-8.
Huang Wen-Ji. Enumerating sentences of context free language based on first one in order. Journal of Computer Research and Development, 41(1):9-14, 2004.
Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In STOC, pages 887-898. ACM, 2012.
Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. MedEx: a medication information extraction system for clinical narratives. JAMIA, 17(1):19-24, 2010. URL: https://doi.org/10.1197/jamia.M3378.
Akane Yakushiji, Yuka Tateisi, Yusuke Miyao, and Jun-ichi Tsujii. Event extraction from biomedical papers using a full parser. In Biocomputing 2001, pages 408-419. World Scientific, 2000.
Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. Navigating the intranet with high precision. In WWW, pages 491-500. ACM, 2007.

Grammars for Document Spanners

Author Liat Peterfreund

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Grammars for Document Spanners

Author Liat Peterfreund

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message