Document Open Access Logo

Recursive Programs for Document Spanners

Authors Liat Peterfreund, Balder ten Cate, Ronald Fagin, Benny Kimelfeld



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2019.13.pdf
  • Filesize: 0.6 MB
  • 18 pages

Document Identifiers

Author Details

Liat Peterfreund
  • Technion, Haifa 32000, Israel
Balder ten Cate
  • Google, Inc., Mountain View, CA 94043, USA
Ronald Fagin
  • IBM Research - Almaden, San Jose, CA 95120, USA
Benny Kimelfeld
  • Technion, Haifa 32000, Israel

Cite AsGet BibTex

Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive Programs for Document Spanners. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 13:1-13:18, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICDT.2019.13

Abstract

A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.

Subject Classification

ACM Subject Classification
  • Theory of computation → Complexity theory and logic
  • Information systems → Relational database model
  • Information systems → Data model extensions
Keywords
  • Information Extraction
  • Document Spanners
  • Polynomial Time
  • Recursion
  • Regular Expressions
  • Datalog

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. Google Scholar
  2. Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49-58. ACM, 2013. Google Scholar
  3. Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. A framework for annotating CSV-like data. PVLDB, 9:876-887, 2016. Google Scholar
  4. Edward Benson, Aria Haghighi, and Regina Barzilay. Event Discovery in Social Media Feeds. In ACL, pages 389-398. The Association for Computer Linguistics, 2011. Google Scholar
  5. Pierre Boullier. From Contextual Grammars to Range Concatenation Grammars. Electr. Notes Theor. Comput. Sci., 53:41-52, 2001. Google Scholar
  6. Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. Complexity and expressive power of logic programming. ACM Comput. Surv., 33(3):374-425, 2001. Google Scholar
  7. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document Spanners: A Formal Approach to Information Extraction. J. ACM, 62(2):12, 2015. Google Scholar
  8. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative Cleaning of Inconsistencies in Information Extraction. ACM Trans. Database Syst., 41(1):6:1-6:44, 2016. Google Scholar
  9. Dominik D. Freydenberger. A Logic for Document Spanners. In ICDT, volume 68 of LIPIcs, pages 13:1-13:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  10. Dominik D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. In ICDT, volume 48 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1-17:17. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. Google Scholar
  11. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining Extractions of Regular Expressions. CoRR, abs/1703.10350, 2017. URL: http://arxiv.org/abs/1703.10350.
  12. Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In ICDM, pages 149-158. IEEE Computer Society, 2009. Google Scholar
  13. Georg Gottlob and Christoph Koch. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. J. ACM, 51(1):74-113, January 2004. Google Scholar
  14. Alon Y. Halevy, Inderpal Singh Mumick, Yehoshua Sagiv, and Oded Shmueli. Static analysis in Datalog extensions. J. ACM, 48(5):971-1012, 2001. Google Scholar
  15. Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell., 194:28-61, 2013. Google Scholar
  16. K. N. King. Alternating Multihead Finite Automata. Theor. Comput. Sci., 61:149-174, 1988. Google Scholar
  17. Christopher C. Leary. A Friendly Introduction to Mathematical Logic. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1999. Google Scholar
  18. Alon Y. Levy, Inderpal Singh Mumick, Yehoshua Sagiv, and Oded Shmueli. Equivalence, Query-Reachability, and Satisfiability in Datalog Extensions. In Catriel Beeri, editor, PODS, pages 109-122. ACM Press, 1993. Google Scholar
  19. Roger Levy and Christopher D. Manning. Deep Dependencies from Context-Free Statistical Parsers: Correcting the Surface Dependency Approximation. In ACL, pages 327-334. ACL, 2004. Google Scholar
  20. Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109-114. ACL, 2011. Google Scholar
  21. Yoav Nahshon, Liat Peterfreund, and Stijn Vansummeren. Incorporating information extraction in the relational database model. In WebDB, page 6. ACM, 2016. Google Scholar
  22. Christos H. Papadimitriou. A note on the expressive power of Prolog. Bulletin of the EATCS, 26:21-22, 1985. Google Scholar
  23. M. Presburger. Über die Vollständigkeit eines gewissen Systems der Arithmetik ganzer Zahlen, in welchem die Addition als einzige Operation hervortritt. In Comptes Rendus du Premier Congrès des Mathématiciens des Pays Slaves, pages 92-101, Warszawa, 1929. Google Scholar
  24. Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. DeepDive: Declarative knowledge base construction. SIGMOD Record, 45(1):60-67, 2016. Google Scholar
  25. Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, pages 1033-1044, 2007. Google Scholar
  26. Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. Incremental Knowledge Base Construction Using DeepDive. PVLDB, 8(11):1310-1321, 2015. Google Scholar
  27. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: A core of semantic knowledge. In WWW, pages 697-706. ACM, 2007. Google Scholar
  28. Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. MedEx: a medication information extraction system for clinical narratives. JAMIA, 17(1):19-24, 2010. URL: http://dx.doi.org/10.1197/jamia.M3378.
  29. Alexander Yates, Michele Banko, Matthew Broadhead, Michael J. Cafarella, Oren Etzioni, and Stephen Soderland. TextRunner: Open information extraction on the web. In ACL-HLT, pages 25-26. ACL, 2007. Google Scholar
  30. Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. Navigating the intranet with high precision. In WWW, pages 491-500. ACM, 2007. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail