A Framework for Extraction and Transformation of Documents

Riveros, Cristian; Schmid, Markus L.; Schweikardt, Nicole

doi:10.4230/LIPIcs.ICDT.2025.18

Abstract

We present a theoretical framework for the extraction and transformation of text documents as a two-phase process: The first phase uses document spanners to extract information from the input document. The second phase transforms the extracted information into a suitable output.
To support several reasonable extract-transform scenarios, we propose for the first phase an extension of document spanners from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans instead of only single spans. We focus on multispanners described by regex formulas, and we prove that these have the same desirable properties as standard regular spanners. To formalize the second phase, we consider transformations that map every pair document-tuple, where each tuple comes from the (multi)span-relation extracted in the first phase, into a new output document. The specification of the two phases is what we call an extract-transform (ET) program, which covers practically relevant extract-transform tasks.
In this paper, our main technical goal is to identify a broad class of ET programs that can be evaluated efficiently. We specifically focus on the scenario of regular ET programs: the extraction phase is given by a regex multispanner and the transformation phase is given by a regular string-to-string function. We show that for any regular ET program, given an input document, we can enumerate all final output documents with output-linear delay after linear preprocessing. As a side effect, we characterize the expressive power of regular ET programs and also show that they have desirable properties, like being closed under composition.

Alfred V Aho and John E Hopcroft. The design and analysis of computer algorithms. Pearson Education India, 1974.
Rajeev Alur, Mikołaj Bojańczyk, Emmanuel Filiot, Anca Muscholl, and Sarah Winter. Regular Transformations (Dagstuhl Seminar 23202). Dagstuhl Reports, 13(5):96-113, 2023. URL: https://doi.org/10.4230/DAGREP.13.5.96.
Rajeev Alur and Pavol Cerný. Expressiveness of streaming string transducers. In FSTTCS, pages 1-12, 2010. URL: https://doi.org/10.4230/LIPICS.FSTTCS.2010.1.
Rajeev Alur and Jyotirmoy V. Deshmukh. Nondeterministic streaming string transducers. In ICALP, volume 6756, pages 1-20, 2011. URL: https://doi.org/10.1007/978-3-642-22012-8_1.
Rajeev Alur, Taylor Dohmen, and Ashutosh Trivedi. Composing copyless streaming string transducers. CoRR, abs/2209.05448, 2022. URL: https://doi.org/10.48550/arXiv.2209.05448.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89-103. ACM, 2019. URL: https://doi.org/10.1145/3294052.3319702.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. SIGMOD Rec., 49(1):25-32, 2020. URL: https://doi.org/10.1145/3422648.3422655.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Transactions on Database Systems (TODS), 46(1):1-30, 2021. URL: https://doi.org/10.1145/3436487.
Jean Berstel. Transductions and context-free languages. Springer-Verlag, 2013.
Mikolaj Bojanczyk. Transducers of polynomial growth. In LICS, pages 1:1-1:27. ACM, 2022. URL: https://doi.org/10.1145/3531130.3533326.
Pierre Bourhis, Alejandro Grez, Louis Jachiet, and Cristian Riveros. Ranked enumeration of MSO logic on words. In 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, pages 20:1-20:19, 2021. URL: https://doi.org/10.4230/LIPICS.ICDT.2021.20.
Vrunda Dave, Paul Gastin, and Shankara Narayanan Krishna. Regular transducer expressions for regular transformations. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, pages 315-324, 2018. URL: https://doi.org/10.1145/3209108.3209182.
Johannes Doleschal, Benny Kimelfeld, and Wim Martens. The complexity of aggregates over extractions by regular expressions. Logical Methods in Computer Science, 19(3), 2023. URL: https://doi.org/10.46298/LMCS-19(3:12)2023.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 149-163, 2019. URL: https://doi.org/10.1145/3294052.3319684.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Logical Methods in Computer Science, 18, 2022. URL: https://doi.org/10.46298/LMCS-18(1:21)2022.
Joost Engelfriet and Hendrik Jan Hoogeboom. MSO definable string transductions and two-way finite-state transducers. ACM Trans. Comput. Log., 2(2):216-254, 2001. URL: https://doi.org/10.1145/371316.371512.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37-48. ACM, 2013. URL: https://doi.org/10.1145/2463664.2463665.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM (JACM), 62(2):1-51, 2015. URL: https://doi.org/10.1145/2699442.
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. Efficient enumeration algorithms for regular document spanners. ACM Transactions on Database Systems (TODS), 45(1):1-42, 2020. URL: https://doi.org/10.1145/3351451.
Dominik Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory of Computing Systems, 62:854-898, 2018. URL: https://doi.org/10.1007/S00224-017-9770-0.
Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679-1754, 2019. URL: https://doi.org/10.1007/S00224-018-9874-1.
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 137-149, 2018. URL: https://doi.org/10.1145/3196959.3196967.
Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, pages 11:1-11:21, 2020. URL: https://doi.org/10.4230/LIPICS.ICDT.2020.11.
Dominik D. Freydenberger and Sam M. Thompson. Splitting spanner atoms: A tool for acyclic core spanners. In 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), pages 10:1-10:18, 2022. URL: https://doi.org/10.4230/LIPIcs.ICDT.2022.10.
Jeffrey Friedl. Mastering regular expressions. " O'Reilly Media, Inc.", 2006.
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database systems - the complete book (2. ed.). Pearson Education, 2009.
Jerry R. Hobbs, Douglas E. Appelt, John Bear, David J. Israel, Megumi Kameyama, Mark E. Stickel, and Mabry Tyson. FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. CoRR, cmp-lg/9705013, 1997. URL: http://arxiv.org/abs/cmp-lg/9705013.
Lauri Karttunen. The replace operator. Finite-State Language Processing, pages 117-147, 1997.
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-136, 2018. URL: https://doi.org/10.1145/3196959.3196968.
Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1-19:18, 2022. URL: https://doi.org/10.4230/LIPICS.ICDT.2022.19.
Martin Muñoz and Cristian Riveros. Constant-delay enumeration for slp-compressed documents. In ICDT, volume 255, pages 7:1-7:17, 2023. URL: https://doi.org/10.4230/LIPICS.ICDT.2023.7.
Anca Muscholl and Gabriele Puppis. The many facets of string transducers (invited talk). In Rolf Niedermeier and Christophe Paul, editors, STACS, volume 126 of LIPIcs, pages 2:1-2:21, 2019. URL: https://doi.org/10.4230/LIPICS.STACS.2019.2.
https://perldoc.perl.org/perlre, 2024. Accessed on 2024-09-16. URL: https://perldoc.perl.org/perlre.
Liat Peterfreund. Grammars for document spanners. In Ke Yi and Zhewei Wei, editors, ICDT, volume 186 of LIPIcs, pages 7:1-7:18, 2021. URL: https://doi.org/10.4230/LIPICS.ICDT.2021.7.
Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019., pages 320-334, 2019. URL: https://doi.org/10.1145/3294052.3319699.
Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, pages 13:1-13:18, 2019. URL: https://doi.org/10.4230/LIPICS.ICDT.2019.13.
Cristian Riveros, Nicolás Van Sint Jan, and Domagoj Vrgoc. Rematch: a novel regex engine for finding all matches. VLDB, 16(11):2792-2804, 2023. URL: https://doi.org/10.14778/3611479.3611488.
Cristian Riveros, Markus L. Schmid, and Nicole Schweikardt. A framework for extraction and transformation of documents. CoRR, abs/2405.12350, 2024. URL: https://doi.org/10.48550/arXiv.2405.12350.
Markus L. Schmid and Nicole Schweikardt. A purely regular approach to non-regular core spanners. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, volume 186 of LIPIcs, pages 4:1-4:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ICDT.2021.4.
Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 153-165, 2021. URL: https://doi.org/10.1145/3452021.3458325.
Markus L. Schmid and Nicole Schweikardt. Document spanners - A brief overview of concepts, results, and recent developments. In Leonid Libkin and Pablo Barceló, editors, PODS '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 139-150. ACM, 2022. URL: https://doi.org/10.1145/3517804.3526069.
Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 79-89, 2022. URL: https://doi.org/10.1145/3517804.3524158.
Luc Segoufin. Enumerating with constant delay the answers to a query. In ICDT, pages 10-20, 2013. URL: https://doi.org/10.1145/2448496.2448498.
Panos Vassiliadis. A survey of extract-transform-load technology. International Journal of Data Warehousing and Mining (IJDWM), 5(3):1-27, 2009. URL: https://doi.org/10.4018/JDWM.2009070101.

A Framework for Extraction and Transformation of Documents

Authors Cristian Riveros , Markus L. Schmid , Nicole Schweikardt

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

A Framework for Extraction and Transformation of Documents

Authors Cristian Riveros , Markus L. Schmid , Nicole Schweikardt

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message