A Logic for Document Spanners

Author Dominik D. Freydenberger



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2017.13.pdf
  • Filesize: 0.57 MB
  • 18 pages

Document Identifiers

Author Details

Dominik D. Freydenberger

Cite As Get BibTex

Dominik D. Freydenberger. A Logic for Document Spanners. In 20th International Conference on Database Theory (ICDT 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 68, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.ICDT.2017.13

Abstract

Document spanners are a formal framework for information extraction that was introduced by [Fagin, Kimelfeld, Reiss, and Vansummeren, J.ACM, 2015]. One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra. As shown by [Freydenberger and Holldack, ICDT, 2016], there is a connection between core spanners and EC^{reg}, the existential theory of concatenation with regular constraints. The present paper further develops this connection by defining SpLog, a fragment of EC^{reg} that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners.

Subject Classification

Keywords
  • information extraction
  • document spanners
  • word equations
  • regex
  • descriptional complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. P. Barceló and P. Muñoz. Graph logics with rational relations: the role of word combinatorics. In Proc. CSL-LICS 2014, 2014. Google Scholar
  2. J.-C. Birget. Intersection and union of regular languages and state complexity. Inform. Process. Lett., 43(4):185-190, 1992. Google Scholar
  3. B. Carle and P. Narendran. On extended regular expressions. In Proc. LATA 2009, 2009. Google Scholar
  4. L. Ciobanu, V. Diekert, and M. Elder. Solution sets for equations over free groups are EDT0L languages. In Proc. ICALP 2015, 2015. Google Scholar
  5. Elena Czeizler. The non-parametrizability of the word equation xyz=zvx: A short proof. Theor. Comput. Sci., 345(2-3):296-303, 2005. Google Scholar
  6. V. Diekert. Makanin’s Algorithm. In M. Lothaire, editor, Algebraic Combinatorics on Words, chapter 12. Cambridge University Press, 2002. Google Scholar
  7. V. Diekert. More than 1700 years of word equations. In Proc. CAI 2015, 2015. Google Scholar
  8. V. Diekert, A. Jeż, and W. Plandowski. Finding all solutions of equations in free groups and monoids with involution. In Proc. CSR 2014, 2014. Google Scholar
  9. V. G. Durnev. Undecidability of the positive ∀∃³-theory of a free semigroup. Sib. Math. J., 36(5):917-929, 1995. Google Scholar
  10. A. Ehrenfeucht and G. Rozenberg. A pumping theorem for EDT0L languages. Technical report, Tech. Rep. CU-CS-047-74, University of Colorado, 1974. Google Scholar
  11. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015. Google Scholar
  12. D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. In Proc. ICDT 2016, 2016. Google Scholar
  13. M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979. Google Scholar
  14. S. Ginsburg and E. H. Spanier. Bounded regular sets. Proc. AMS, 17(5):1043-1049, 1966. Google Scholar
  15. J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979. Google Scholar
  16. L. Ilie. Subwords and power-free words are not expressible by word equations. Fundam. Inform., 38(1-2):109-118, 1999. Google Scholar
  17. L. Ilie and W. Plandowski. Two-variable word equations. ITA, 34(6):467-501, 2000. Google Scholar
  18. J. Karhumäki, F. Mignosi, and W. Plandowski. The expressibility of languages and relations by word equations. J. ACM, 47(3):483-505, 2000. Google Scholar
  19. J. Karhumäki, W. Plandowski, and W. Rytter. Generalized factorizations of words and their algorithmic properties. Theor. Comput. Sci., 218(1):123-133, 1999. Google Scholar
  20. J. Karhumäki and A. Saarela. An analysis and a reproof of Hmelevskii’s theorem. In Proc. DLT 2008, 2008. Google Scholar
  21. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31-88, 2001. Google Scholar
  22. D. Reidenbach and M. L. Schmid. Patterns with bounded treewidth. Inform. Comput., 239:87-99, 2014. Google Scholar
  23. M. L. Schmid. Characterising REGEX languages by regular languages equipped with factor-referencing. Inform. Comput., 249:1-17, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail