A Purely Regular Approach to Non-Regular Core Spanners

Authors Markus L. Schmid , Nicole Schweikardt



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2021.4.pdf
  • Filesize: 0.8 MB
  • 19 pages

Document Identifiers

Author Details

Markus L. Schmid
  • Humboldt-Universität zu Berlin, Germany
Nicole Schweikardt
  • Humboldt-Universität zu Berlin, Germany

Acknowledgements

We wish to thank the anonymous reviewers for their valuable feedback that improved the readability of this paper.

Cite AsGet BibTex

Markus L. Schmid and Nicole Schweikardt. A Purely Regular Approach to Non-Regular Core Spanners. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 4:1-4:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICDT.2021.4

Abstract

The regular spanners (characterised by vset-automata) are closed under the algebraic operations of union, join and projection, and have desirable algorithmic properties. The core spanners (introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015) as a formalisation of the core functionality of the query language AQL used in IBM’s SystemT) additionally need string equality selections and it has been shown by Freydenberger and Holldack (ICDT 2016, Theory of Computing Systems 2018) that this leads to high complexity and even undecidability of the typical problems in static analysis and query evaluation. We propose an alternative approach to core spanners: by incorporating the string-equality selections directly into the regular language that represents the underlying regular spanner (instead of treating it as an algebraic operation on the table extracted by the regular spanner), we obtain a fragment of core spanners that, while having slightly weaker expressive power than the full class of core spanners, arguably still covers the intuitive applications of string equality selections for information extraction and has much better upper complexity bounds of the typical problems in static analysis and query evaluation.

Subject Classification

ACM Subject Classification
  • Information systems → Information retrieval
  • Theory of computation → Automata extensions
  • Theory of computation → Regular languages
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Database query languages (principles)
Keywords
  • Document spanners
  • regular expressions with backreferences

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, pages 22:1-22:19, 2019. Google Scholar
  2. Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 457-466, 2016. URL: https://doi.org/10.1109/FOCS.2016.56.
  3. Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 149-163, 2019. Google Scholar
  4. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12:1-12:51, 2015. Google Scholar
  5. Henning Fernau, Markus L. Schmid, and Yngve Villanger. On the parameterised complexity of string morphism problems. Theory of Computing Systems (ToCS), 59(1):24-51, 2016. Google Scholar
  6. Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 165-177, 2018. Google Scholar
  7. Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019. URL: https://doi.org/10.1007/s00224-018-9874-1.
  8. Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854-898, 2018. URL: https://doi.org/10.1007/s00224-017-9770-0.
  9. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 137-149, 2018. Google Scholar
  10. Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In 23rd International Conference on Database Theory, ICDT 2020, March 30 - April 2, 2020, Copenhagen, Denmark, pages 11:1-11:21, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.11.
  11. Florin Manea and Markus L. Schmid. Matching patterns with variables. In Combinatorics on Words - 12th International Conference, WORDS 2019, Loughborough, UK, September 9-13, 2019, Proceedings, pages 1-27, 2019. URL: https://doi.org/10.1007/978-3-030-28796-2_1.
  12. Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 125-136, 2018. Google Scholar
  13. Liat Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Computer science department, Technion, 2019. Google Scholar
  14. Liat Peterfreund. Grammars for document spanners. In 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, 2021. To appear. Extended version available at URL: https://arxiv.org/abs/2003.06880.
  15. Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019., pages 320-334, 2019. Google Scholar
  16. Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, pages 13:1-13:18, 2019. Google Scholar
  17. Markus L. Schmid. Characterising REGEX languages by regular languages equipped with factor-referencing. Information and Computation, 249:1-17, 2016. Google Scholar
  18. Markus L. Schmid and Nicole Schweikardt. A purely regular approach to non-regular core spanners. CoRR, abs/2010.13442, 2020. URL: http://arxiv.org/abs/2010.13442.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail