Recursive Programs for Document Spanners

Peterfreund, Liat; Cate, Balder ten; Fagin, Ronald; Kimelfeld, Benny

doi:10.4230/LIPIcs.ICDT.2019.13

Abstract

A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.

Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.
Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49-58. ACM, 2013.
Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. A framework for annotating CSV-like data. PVLDB, 9:876-887, 2016.
Edward Benson, Aria Haghighi, and Regina Barzilay. Event Discovery in Social Media Feeds. In ACL, pages 389-398. The Association for Computer Linguistics, 2011.
Pierre Boullier. From Contextual Grammars to Range Concatenation Grammars. Electr. Notes Theor. Comput. Sci., 53:41-52, 2001.
Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. Complexity and expressive power of logic programming. ACM Comput. Surv., 33(3):374-425, 2001.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document Spanners: A Formal Approach to Information Extraction. J. ACM, 62(2):12, 2015.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative Cleaning of Inconsistencies in Information Extraction. ACM Trans. Database Syst., 41(1):6:1-6:44, 2016.
Dominik D. Freydenberger. A Logic for Document Spanners. In ICDT, volume 68 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
Dominik D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. In ICDT, volume 48 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1-17:17. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining Extractions of Regular Expressions. CoRR, abs/1703.10350, 2017. URL: http://arxiv.org/abs/1703.10350.
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In ICDM, pages 149-158. IEEE Computer Society, 2009.
Georg Gottlob and Christoph Koch. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. J. ACM, 51(1):74-113, January 2004.
Alon Y. Halevy, Inderpal Singh Mumick, Yehoshua Sagiv, and Oded Shmueli. Static analysis in Datalog extensions. J. ACM, 48(5):971-1012, 2001.
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell., 194:28-61, 2013.
K. N. King. Alternating Multihead Finite Automata. Theor. Comput. Sci., 61:149-174, 1988.
Christopher C. Leary. A Friendly Introduction to Mathematical Logic. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1999.
Alon Y. Levy, Inderpal Singh Mumick, Yehoshua Sagiv, and Oded Shmueli. Equivalence, Query-Reachability, and Satisfiability in Datalog Extensions. In Catriel Beeri, editor, PODS, pages 109-122. ACM Press, 1993.
Roger Levy and Christopher D. Manning. Deep Dependencies from Context-Free Statistical Parsers: Correcting the Surface Dependency Approximation. In ACL, pages 327-334. ACL, 2004.
Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011.
Yoav Nahshon, Liat Peterfreund, and Stijn Vansummeren. Incorporating information extraction in the relational database model. In WebDB, page 6. ACM, 2016.
Christos H. Papadimitriou. A note on the expressive power of Prolog. Bulletin of the EATCS, 26:21-22, 1985.
M. Presburger. Über die Vollständigkeit eines gewissen Systems der Arithmetik ganzer Zahlen, in welchem die Addition als einzige Operation hervortritt. In Comptes Rendus du Premier Congrès des Mathématiciens des Pays Slaves, pages 92-101, Warszawa, 1929.
Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. DeepDive: Declarative knowledge base construction. SIGMOD Record, 45(1):60-67, 2016.
Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, pages 1033-1044, 2007.
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. Incremental Knowledge Base Construction Using DeepDive. PVLDB, 8(11):1310-1321, 2015.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: A core of semantic knowledge. In WWW, pages 697-706. ACM, 2007.
Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. MedEx: a medication information extraction system for clinical narratives. JAMIA, 17(1):19-24, 2010. URL: http://dx.doi.org/10.1197/jamia.M3378.
Alexander Yates, Michele Banko, Matthew Broadhead, Michael J. Cafarella, Oren Etzioni, and Stephen Soderland. TextRunner: Open information extraction on the web. In ACL-HLT, pages 25-26. ACL, 2007.
Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. Navigating the intranet with high precision. In WWW, pages 491-500. ACM, 2007.

Recursive Programs for Document Spanners

Authors Liat Peterfreund, Balder ten Cate, Ronald Fagin, Benny Kimelfeld

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Recursive Programs for Document Spanners

Authors Liat Peterfreund, Balder ten Cate, Ronald Fagin, Benny Kimelfeld

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message