A Framework for Extraction and Transformation of Documents
Abstract
We present a theoretical framework for the extraction and transformation of text documents as a two-phase process: The first phase uses document spanners to extract information from the input document. The second phase transforms the extracted information into a suitable output.
To support several reasonable extract-transform scenarios, we propose for the first phase an extension of document spanners from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans instead of only single spans. We focus on multispanners described by regex formulas, and we prove that these have the same desirable properties as standard regular spanners. To formalize the second phase, we consider transformations that map every pair document-tuple, where each tuple comes from the (multi)span-relation extracted in the first phase, into a new output document. The specification of the two phases is what we call an extract-transform (ET) program, which covers practically relevant extract-transform tasks.
In this paper, our main technical goal is to identify a broad class of ET programs that can be evaluated efficiently. We specifically focus on the scenario of regular ET programs: the extraction phase is given by a regex multispanner and the transformation phase is given by a regular string-to-string function. We show that for any regular ET program, given an input document, we can enumerate all final output documents with output-linear delay after linear preprocessing. As a side effect, we characterize the expressive power of regular ET programs and also show that they have desirable properties, like being closed under composition.
Keywords and phrases:
Information extraction, Document spanners, Transducers, Query evaluationFunding:
Cristian Riveros: Supported by ANID Fondecyt Regular project 1230935, by ANID –Millennium Science Initiative Program – Code ICN17_002, and by the Alexander von Humboldt Foundation. This work was developed during a research fellowship at the Humboldt University, funded by the Alexander von Humboldt Foundation.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Database theoryEditors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
Information extraction (IE) of text documents found its theoretical foundations in the framework of document spanners [18]. Introduced by Fagin, Kimelfeld, Reiss, and Vansummeren one decade ago [17], this framework sets the grounds for rule-based IE through the notion of spanners and how to combine them by using relational operators. Moreover, it inspired a lot of research on languages [21, 34], expressiveness [20, 29, 36, 14, 39], evaluation [19, 8, 14, 11, 22, 35, 24], provenance [15, 13], and compressed evaluation [40, 42, 31] for IE. Initially conceived to understand SystemT – IBM’s rule-based IE system – it has found some recent promising implementations in [37]. See [41, 7] for surveys of the area.
Although IE is a crucial task for data management, document extraction is usually followed by a transformation into new data objects. Indeed, data transformation is essential for communicating middleware systems, where data does not conform to the required format of a third-party system and needs to be converted. This forms the main task of the so-called ETL technologies [44] (for Extract-Transform-Load), which are crucial parts of today’s data management workflows. Even the search-and-replace operation over text, a ubiquitous task in all text systems, can be conceived as an extraction (i.e., search) followed by a transformation (i.e., replace). Hence, although document spanners provide the formal ground of ruled-based IE, they are incomplete without understanding the subsequent transformation processes.
In this paper, we study the extraction and transformation of text documents through the lens of document spanners. We formulate extract-transform programs (ET programs for short) as a two-phase process: first, an extraction phase extracts some information from the input document, and then a transformation phase maps the document and the extracted information into the desired output. Depending on the particular application context, the output may be a relational database, graph data, a single document, or a collection of documents. More specifically, the first phase is based on information extraction with document spanners, i. e., it extracts span-tuples from a document. The second phase is deliberately left underspecified at this point. In principle, any class of functions could be used here, depending on the expressive power and algorithmic properties that we aim for, which, in turn, depend on the specific application context.
Let us now explain and motivate our framework of ET programs with some examples. Consider the following document over alphabet that lists famous English-speaking singers in the format [person] [person] … [person] with [person] [last name]; [first name]; [birthplace]; [opt]; [opt];…, where [opt] are optional attributes like, e. g., age, nickname, honorary titles etc, which follow no fixed scheme. This means that “” and “;” are used as separators, and for convenience, we will use .
# Holiday; Billie; USA # Bush; Kate; England # Young; Neil; Canada; 78; ‘‘Godfather of Grunge’’ # King; Carole; USA; 81 # McCartney; Paul; England; Sir; CH; MBE # Mitchell; Joni; Canada; painter #
As mentioned above, the extraction phase is performed by applying a document spanner to . For example, consider the document spanner specified by the following regex formula (cf. [18]): that uses a variable x to extract just any factor over that occurs between separators “” and “;”, and a variable y to extract the following factor over between the next two occurrences of “;”. The construct creates a span pointing to factor , satisfying the subexpression . The regex formula specifies a spanner that, on our example document produces the set of span-tuples. Hence, every span-tuple can be interpreted as representing a pair of factors extracted from our document, i. e., the span-tuple represents , the span-tuple represents and so on. Consequently, this spanner extracts all the names of the singers of our input document.
As a reasonable transformation task, we now might want to produce lots of small documents, each consisting of the name of a person mentioned in , but in the format “[first name] [last name]”. In the given example, this results in the collection of the following documents: Billie Holiday, Kate Bush, Neil Young, Carole King, Paul McCartney, Joni Mitchell. Formally, this transformation is done by a function that maps each with to the string . Note that the order of first and last names has to be swapped.
Let us move on to a more complicated example. We want to define an ET program which maps to the following collection of XML documents:
This means that for every singer with optional attributes we want to construct an XML document that contains the name of the respective singer and all the optional attributes. A natural way to approach the corresponding extraction task is to extract the name and surname in two variables as before, and to extract each element of the unbounded list [opt-1]; [opt-2]; …, [opt-k] in its own variable. This, however, goes beyond the capability of document spanners, since we would need an unbounded number of different variables.
As a remedy, in this paper we propose to extend the classical spanner framework to multispanners, which can extract in each variable a set of spans instead of only a single span. For example, assume that we apply to the document the multispanner regex formula:
Then for #Young; Neil; Canada; 78; ‘‘Godfather of Grunge’’#, we will extract each optional attribute as a single span of variable z, so the corresponding multispan-tuple is:
Thus, by defining a suitable transformation , we get an ET program for our task.
Our contributions
We start the study of ET programs from an evaluation perspective. Towards this goal, we devise a specific instantiation of our ET framework that achieves a desirable balance between the following objectives:
Robustness.
Rather than defining ad-hoc ET programs, we are interested in identifying a class of ET programs that, for the extraction and transformation phase, uses established and robust computational models that have several equivalent (and user-friendly) description mechanisms. For the extraction phase, we use a class of multispanners that can be described by a variant of the regex formulas already used for classical spanners (see [17]); see the examples and from above. This model is particularly user-friendly and has good algorithmic properties. Our transformation functions are based on the class of regular string-to-string functions, a well-understood formalism that allows many equivalent description mechanisms.
Expressive Power.
Our instantiation of the ET-framework, called regular ET programs, can describe several common extract and transform tasks, including the two examples discussed above.
Efficient Enumeration.
The most relevant computational task is to compute all outputs of a regular ET program. Since there are potentially many such outputs (even exponential in the number of variables), an enumeration algorithm is desirable. One could treat the two phases separately: Enumerate all span-tuples extracted in the first phase (a task for which algorithms are known [8, 31]) and then apply the transformation function over each enumerated span-tuple. However, the delay of such an enumeration procedure depends on the size of the input document, since it is an argument of the transformation function. Instead, we design an enumeration algorithm with linear preprocessing that directly enumerates all output documents with output-linear delay.
Composability.
Regular ET programs satisfy a property that is in general desirable for practical applications: Composability. This means that if we take the outputs of one ET program and feed them in as inputs for another ET program, then the obtained transformation can also be described by a single ET program. In particular, this also means that we can efficiently enumerate the output documents of the transformation process described by composing several regular ET programs.
Outline
After reviewing further related work, Section 2 introduces the class of multispanners. In Section 3, we present the general setting of ET programs. Then, in Section 4, we instantiate this setting to the case of regular ET programs. In Section 5, we study the expressive power of regular ET programs, and in Section 6, we present our evaluation algorithm. In Section 7, we study the composition of such programs. Finally, we discuss future work in Section 8. Due to space limitations, proof details can be found in the online version [38].
Further related work
String transductions [9, 32, 10] – a classical model in computer science – have recently gained renewed attention with the theory of polyregular functions [2]. Although we can see an ET program as a single transduction, our work has several novel contributions compared to classical string transductions. Firstly, our framework models the process by two declarative phases (which is natural from a data management perspective), contrary to string transductions that model the task as a single process. Secondly, we are concerned with bag semantics, which are usually not considered in the context of transductions. Moreover, efficient enumeration algorithms have not been studied in this context.
On the practical side, there are systems for transforming documents into documents (e.g., [27]). For example, they used a combination of regular expressions with replace operators [28] or parsing followed by a transduction over the parsing tree [27]. Indeed, in practice, regular expressions support some special commands for transforming data (also called substitutions). Our study has a theoretical emphasis on information extraction. To the best of our knowledge, previous systems neither use the expressive power of document spanners nor regular functions. In particular, previous systems cannot define queries like through regular expressions plus a transformation.
2 Multispanners
Let us first give some standard notations. By we denote the power set of a set . Let and for . For a finite alphabet , let denote the set of non-empty words over , and , where is the empty word. For a word , denotes its length (in particular, ). A word is a factor of if there are with ; is a prefix or suffix of , if or , respectively. For every , let denote the symbol at position of . We use DFAs and NFAs (deterministic and nondeterministic finite automata, resp.) as commonly defined.
Multispans and multispanners
For a document and for every with , is a span of and its value, denoted by , is the substring of from symbol to symbol . In particular, (called an empty span) and . By , we denote the set of spans of , and by we denote the set of all spans . Two spans and are disjoint if or . A multispan is a (possibly empty) set of pairwise disjoint spans. Let be a finite set of variables. A span-tuple (over a document and variables ) [18] is a function . We define a multispan-tuple as a function such that, for every , is a multispan. Note that every span-tuple can be considered as a special case of a multispan-tuple with for every . For simplicity, we usually denote multispan-tuples in tuple-notation, for which we assume an order on . For example, if with , the multispan-tuple maps to , to and to . Note that and are disjoint, while and are not, which is allowed, since they are spans of different multispans.
A multispan-relation (over a document and variables ) is a possibly empty set of multispan-tuples over and . Given a finite alphabet , a multispanner (over and ) is a function that maps every document to a multispan-relation over and . Note that the empty relation is also a valid image of a multispanner.
Example 1.
Let be a multispanner over alphabet and variables that maps every document to the set of all multispan-tuples such that where is a factor that starts and ends with , and is not directly preceded or followed by another , and is the multispan that contains a span for each maximal unary (i.e., of the form or ) factor of . For example, with:
Similarly to the framework of document spanners [18], it is convenient to represent multispanners over and by formal languages over the alphabet . This allows us to represent multispanners by formal language descriptors, e. g., regular expressions.
Representing multispans by multiref-words
In this section, we adapt the concept of ref-words (commonly used for classical document spanners; see [23, 21, 14, 41]) to multispanners.
For any set of variables, we shall use the set as an alphabet of meta-symbols. In particular, for every , we interpret the pair of symbols and as a pair of opening and closing parentheses. A multiref-word (over alphabet and variables ) is a word such that, for every , the subsequence of the occurrences of and of is well-balanced and unnested, namely, has the form for some .
Intuitively, any multiref-word over and uniquely describes a document and a multispan-tuple as follows. First, let be obtained from by erasing all symbols from . We note that, for every , every matching pair and in (i. e., every occurrence of and the following occurrence of ) uniquely describes a span of : ignoring all other occurrences of symbols from , this pair encloses a factor of . Consequently, we simply define that, for every , contains all spans defined by matching pairs and of . The property that the subsequence of of the occurrences and has the form for some implies that all spans of are pairwise disjoint.
Example 2.
Let us consider the following multiref-word over the finite alphabet and variables : . By definition, represents the document and the multispan-tuple (from Example 1):
The advantage of the notion of multiref-words is that it allows to easily describe both a document and some multispan-tuple over this document. Therefore, one can use any set of multiref-words to define a multispanner as follows. A multiref-language (over terminal alphabet and variables ) is a set of multiref-words over and . Any multiref-language describes the multispanner over and defined as follows. For every : .
Analogously to classical spanners (cf. [39, 23, 21, 14, 41]), we define the class of regular multispanners as those multispanners with for some regular multiref-language . Moreover, as done in [18] for classical spanners, we will use a class of regular expressions to define a subclass of regular multispanners that shall play a central role in the extraction phase of our extraction and transformation framework.
Regex multispanners
Let be a finite alphabet and let be a finite set of variables. We now define multispanner-expressions (over and ). Roughly speaking, these expresssions are a particular class of regular expressions for defining sets of multiref-words and therefore multispanners. A multispanner-expression (over and ) satisfies the syntax:
for every such that x does not appear in . Such a multispanner-expression naturally defines a set of multiref-words as follows: , , , , , and , where, for every , , , for every , and . As usual, we use as a shorthand for .
One can easily prove that any multispanner-expression defines a multiref-language, since we do not allow expressions of the form whenever mentions x. Thus, we can define the multispanner specified by as . Furthermore, we say that a multispanner is a regex multispanner if for some multispanner-expression . Note that regex multispanners form a strict subclass of regular multispanners, similar to the class of regex spanners and regular spanners [18].
Example 3.
is a multispanner-expression with being the multispanner from Example 1; thus, is a regex multispanner.
Comparison with classical spanners
Multispanners are designed to naturally extend the classical model of spanners from [18] to the setting where variables are mapped to sets of spans instead of single spans. Let us discuss a few particularities of our definitions.
We first note that since classical span-tuples, span-relations and spanners (in the sense of [18]) can be interpreted as special multispan-tuples, multispan-relations and multispanners, respectively, our framework properly extends the classical spanner framework.
A multispan-tuple allows and with and (and this is also the case for classical span-tuples). However, with is not possible for multispan-tuples, since then representing and by parentheses in the document cannot be distinguished from representing and . Furthermore, for distinct we require that and are disjoint, which is motivated by the fact that without this restriction, the subsequence of all and occurrences could be an arbitrary well-formed parenthesised expression (instead of a sequence ); thus, recognising whether a given string over is a proper multiref-word cannot be done by an NFA, as is the case for classical spanners.
Our main motivation for multispanners is that they can express information extraction tasks that are of interest in the context of our extract-transform framework (and that cannot be represented by classical spanners in a convenient way). However, there are other interesting properties of multispanners not related to their application in our extract-transform framework, which deserve further investigation. For example, if and are multiref-languages (i. e., and are multispanners), then , and are also multiref-languages (and therefore , and are also multispanners). For classical spanners, this is only true for the union. Consequently, multispanners show some robustness not provided by classical spanners.
3 The extract-transform framework
The setting
An extract-transform program (for short: ET program) is a pair such that, for some finite alphabet ,
-
is a multispanner (over and some finite set of variables), and
-
is a function that, for every document , maps to the desired output.
In other words, given a document , the multispanner specifies how to extract the relevant data from as a multispan-relation , and specifies how to transform the extracted data into new data. The output of is application-dependent, and various contexts are conceivable, such as transforming the data into a relational database, a graph database, a single document, or a collection of documents.
One may wonder why we must define the setting in two phases and why not simplify it into one phase. In the end, the purpose of users is to transform data, and then they may like to specify the transformation with a single query language. From a user perspective, we argue that it is useful to have a two-phase specification for several reasons. First, it is already the case that, in practice, people specify the transformation of documents with a two-phase approach. For example, search-and-replace is given by a pattern (i.e., ) and the string to be replaced (i.e., ). More generally, the so-called regex substitutions [33, 25] are specified by a regex (i.e., ) and a replacement pattern (i.e., ). Second, in a more general sense, data transformation today is performed between different data models by ETL programs [44] (for Extract-Transform-Load) where the extract and transform steps are specified by different languages with different purposes. We made the same decision here, where multispanners are well-suited for the extraction phase, and we left open the language to be used for the transformation phase (that depends on the application). Last, one can argue that separating the process between extraction and transformation aids in decoupling the source from the target data, simplifying the overall specification for users. Indeed, an expert user in the source data can specify the multispanner , whereas another expert user in the target data can provide the transformation . Moreover, this separation allows that whenever the middle schema (i.e., ) is not changed, one can update or without modifying the other part.
From documents to bags of documents
In this work, we focus on the case where the transformation is given by a function that maps a pair to a document. This means that the transformation is fully determined by the function . For a given document , the output of an ET program is a bag of documents defined as follows:
In the introduction, we presented two examples of this setting, where the ET programs map documents into a bag of documents. Next, we provide another example to show that this setting also includes the search-and-replace scenario as a special case.
Example 4.
Let be a document over the alphabet . Suppose that a user wants to search for all contiguous sequences of and replace each such sequence with a single . For example, the user wants to convert into . In this scenario, the user wants to map its document to a single document. For this goal, we can extract all contiguous sequences of in a single multispan-tuple with the following expression: The reader can note that the above expression will always capture a single multispan-tuple where contains a set of all spans of contiguous sequences of . Then, with a function that replaces each non-empty string with a single in for every , will output a single document with the expected result. Note that we are extracting all spans with a single variable (i.e., no intersection between spans), which allows us to define the meaning of “replace” easily. Arguably, this provides some evidence to the need of multispanners for such applications.
Our decision to focus on bags (instead of sets) is that each multispan-tuple represents a data item and context in the document. Although could map two data pieces to the same output document, a user may want to keep both copies, given that they come from different positions in the document. Let us motivate this briefly by a more practical example. Assume that the extraction phase marks the addresses in a list of customer orders, while the transformation phase then transforms these addresses into a format suitable for printing on the parcels to be shipped to the customers. In the likely situation that there are different orders by the same person, the extraction phase produces different markings that will all be transformed into the same address label. Such duplicates, however, are important, since we actually want to obtain an address label for each distinct order (i. e., distinct marking), even if these addresses are the same (although the orders are not). This means that the output sets of our ET programs should actually be bags for this scenario. Indeed, similar situations occur for relational data management systems (and database systems in general) where bag semantics is usually adopted [26]. Of course, one can also consider the scenario when the output is a single document or a set of documents. Both are interesting scenarios, and we leave them for future work (see Section 8 for further discussion).
The evaluation problem of ET-programs
The main technical goal of this paper is to identify a large class of ET programs for which, upon input of any document , the output can be computed efficiently. Arguably, identifying such a large class of ET-programs is crucial for the foundations of extract-transform tasks. On the one hand, it determines which ET-programs can be used in practice and, on the other hand, it guides the design of query languages for the specifications and , which depends on the application.
In this work, we study the case when is given by a regular function, an expressive class of string-to-string functions that has several equivalent characterizations and good algorithmic properties. In the sequel, we show how to combine the result of a regex multispanner with a regular function to then present our approach to evaluate such ET programs efficiently.
4 Using regular functions for the transformation phase
Let us first present some more notation. As usual, denotes a function from to . When is partial, we write and use to indicate that is undefined for element . Moreover, denotes the domain of . Every partial function is called a string-to-string function with alphabets and .
In this paper, we study the case when we use regular string-to-string functions for the transformation phase of ET programs. Let us explain why the class of regular functions is a good choice for our setting. As mentioned before, the class of regular functions forms a well-understood formalism for transforming strings which allows for several equivalent representations like two-way transducers, MSO transductions [16], regular transducer expressions [12], or deterministic streaming string transducers (DSSTs) [3] (this last formalism will be crucial for the rest of this paper). Furthermore, regular functions can express most of the linear transformations that one can find in practice. For example, all the use cases presented so far can be defined by using regular functions as our basis for the transformation phase of ET programs. Finally, regular functions have good algorithmic properties, like being closed under composition, a property that we will exploit later (see Section 7).
For using regular functions as our mechanism for the tranformation phase, we need to first align the output of the extraction phase (i.e., a multispan relation) with the input of a regular function (i.e., a string). For this purpose, it is necessary to convert the output of a multispanner (i.e., multispan-tuples) into the input of a string-to-string function (i.e., strings). In the following, we present a unique way to encode multispan-tuples into multiref-words, which will serve as our input object for using regular functions.
A unique multiref-word representation
The representation of documents and multispan-tuples by multiref-words allows us to define multispanners by multiref-languages. But this representation is not unique, which is inconvenient if we want to use it for encoding multispan-tuples as strings. As a remedy, we adopt the following approach: We represent a document and multispan-tuple as a multiref-word such that factors of that belong to (i.e., factors between two letters of ) have the form:
Specifically, let be a document and a multispan-tuple over and variables . For a variable and , define , where with iff for some ; iff ; and iff for some . E. g., for the tuple in Example 1, we have that and .
Let with , where is some fixed linear order on . For every , we define . We can then define the encoding of and as the multiref-word:
Coming back to Example 1, we have by using the order . Note that is a multiref-word, and . Thus, is a correct and unique encoding for and as a multiref-word.
Regular ET-programs
We are now ready to formally define the following class of ET programs. An ET program is called a regular ET program (from to ) if is a regex multispanner specified by some multispanner-expression over and some finite set of variables, and is specified by a regular string-to-string function with input alphabet and output alphabet . The result of on is defined as
Note that this definition of means that we first apply the spanner specified by on , which produces a multispan-relation . Then, for every multispan-tuple , we apply on the multiref-word producing a new document in the output bag . Note that while is injective, the function might not be. Hence, for with we might have . This leads to duplicates, but, as explained before, we keep them, since we want each distinct of the extraction phase to correspond to a transformed document in the output.
So far, we have introduced our object of study, regular ET-programs, but we still need to define how to specify them. In particular, how to specify the regular string-to-string function . Regular string-to-string functions are a robust class that admit several possible representations and people have proposed logic-based languages to specify them like MSO transductions and regular transducer expressions [10]. In particular, one can use any of them as the basis for a declarative query language for specifying . In this paper, we concentrate on the evaluation problem of regular ET-programs. Toward this goal, we use deterministic streaming string transducers (DSSTs, [3]) as our model for defining regular functions.
Deterministic streaming string transducers
Let be a finite set of registers and a finite alphabet. An assignment is a partial function that assigns each register to a string of registers and letters from . We define the extension of an assignment such that for every string where for every . Further, we assume that is undefined iff is undefined for some . Given two assignments and , we define its composition as a new assignment such that . We say that an assignment is copyless if, for every , there is at most one occurrence of in all the strings with .
Example 5.
Consider and , and the following assignments , where we write to mean : is defined via ; . is defined via ; . is defined via ; . One can check that . Also, and are copyless assignments, but is not.
Note that copyless assignments are closed under composition, namely, if and are copyless assignments, then is copyless as well. We denote by the set of all copyless assignments over registers and alphabet .
A deterministic streaming string transducer (DSST) [3] is a tuple , where is a finite set of states, is the input alphabet, is the output alphabet, is a finite set of registers, is the transition function, is the initial state, and is a final partial function such that, for every and , appears at most once in . Intuitively, if is defined, then is a final (i.e., accepting) state.
A configuration of a DSST is a pair where and , and is the set of all assignments , which are called valuations. A run of over a string is a sequence of configurations of the form:
such that , is the empty assignment, i. e. for every , and for every . A run is called an accepting run if is defined. The output of an accepting run is defined as the string .
Since is a partial function, we can see that DSSTs are deterministic, i.e., for every input word , there exists at most one run . Thus, every DSST defines a string-to-string function such that iff is the run of over and is accepting.
Example 6.
Recall the first example from the introduction, where the spanner extracts span-tuples where and refer to the spans containing a person’s last and first name, respectively. The following illustration depicts a DSST that receives as input a multiref-word for of the form where and (recall that ), and which outputs the word where transitions without annotations (i.e., assignments) use the identity assignment.
DSSTs define a well-behaved class of string-to-string functions, equivalent to the the class of regular string-to-string functions (see [10, 3]). In other words, we can use DSSTs as our computational model for specifying the function since all other formalisms for declaring regular functions can be effectively compiled into DSSTs (e.g., MSO transductions or regular transducer expressions). Moreover, DSSTs perform a single pass over the input string, while other equivalent models (e.g., two-way transducers) may do several passes. This streaming behavior of DSSTs will be very useful for designing algorithms for regular ET programs. For this reason, we use DSSTs as our model of string-to-string functions for regular ET programs.
5 Expressiveness of regular ET programs
Nondeterministic SST
We extend DSSTs to the nondeterministic case and bag semantics. Let us explain the model by pointing out its differences to DSSTs.111Note that nondeterministic streaming string transducers with set semantics instead of bag semantic have already been introduced in [4].
A nondeterministic streaming string transducer (NSST) is a tuple , where , , , and have the same meaning as for DSSTs. The partial function plays the role of the initial state, i. e., a state is a possible initial state if is defined, and in this case is an initial valuation of the registers. Moreover, is a bag of elements from . Note that we define as a bag and not as a set for technical reasons (e.g., the proof of Proposition 10 or Theorem 12).
The semantics of the model are as expected. A run over a string is a sequence of configurations of the form (4) such that is defined and , and for every . As for DSSTs, is an accepting run if is defined, and the output of an accepting run is the string . We define as the bag of all accepting runs of over .
Finally, we define the semantics of an NSST over a string as the bag:
Equivalence of regular ET programs and NSSTs
We show that regular ET programs and NSSTs are equivalent. This is a fundamental insight with respect to the expressive power of ET programs. Moreover, the fact that the two-stage model of regular ET programs can be described by a single NSST will be important for our enumeration algorithm (Section 6) and their composition (Section 7).
Before going into the technical details, one may wonder why regular ET programs have two phases when this characterization of the expressive power shows that we can do the whole process with a single NSST. As we already argued in Section 3, a two-phase process is designed from a user perspective that aids the declaration of the whole process and helps for the future use of the specifications. On the contrary, the one-phase process of NSSTs is helpful for an algorithmic perspective, where we want to evaluate the ET program as a single process and in a single pass.
We first discuss how regular ET programs can be transformed into NSSTs. Every multiref-word over and describes a document and a multispan-tuple . Hence, we can extend the encoding to multiref-words by setting . Intuitively speaking, applying the function on a multiref-word simply means that every maximal factor of is re-ordered according to the order on , and superfluous matching brackets are removed (since several of those in the same maximal factor over would describe the same empty span several times). We define for multiref-languages .
Let us consider a regular ET program , i. e., is a regex multispanner represented by some multispanner-expression , and is a regular function represented by some DSST . The high-level idea of the proof is to construct an NSST that simulates a DFA for and the DSST in parallel. More precisely, we read an input , but between reading a symbol and the next symbol , we pretend to read a sequence of symbols from with and at the same time. Thus, we virtually read some multiref-word with the property . We need the DSST for producing an output on that multiref-word, and we need the DFA to make sure that the virtual multiref-word has the form for some .
One may wonder why we want to be a DFA rather than an NFA. The reason is: Having to deal with bag semantics (rather than just set semantics) makes things more complicated. In particular, this means that we need to be deterministic to have a one-to-one correspondence between the accepting runs of and the accepting runs of on the corresponding multiref-word. Otherwise, if was an NFA, the different possible accepting paths of on the same multiref-word would translate into different accepting paths of which would cause erroneous duplicates in .
We can transform into a DFA for by standard automata constructions, but the fact that needs to be deterministic means that the construction is (one-fold) exponential in , and the fact that it has to accept and not just means that the construction is also (one-fold) exponential in . In summary, we obtain the following.
Theorem 7.
Given a regex multispanner over and (represented by a multispanner-expression ), and a regular string-to-string function with input alphabet (represented by a DSST with states), we can construct an NSST with in time .
Let us now move on to representing general NSSTs by ET programs. When an NSST is in a state and reads a symbol , then the number of possible nondeterministic choices it can make is the sum over all the multiplicities of elements (recall that is the bag of transitions). We shall call this number the nondeterministic branching factor of and . The nondeterministic branching factor of is the maximum over all the nondeterministic branching factors of and for all .
The only obstacle in the simulation of an NSST by a DSST is that the latter cannot deal with the nondeterministic choices. However, in regular ET programs, a DSST gets an annotated version of the actual document as input, i. e., a multiref-word with . Consequently, the DSST could interpret the additional information given by in the form of the symbols from as information that determines which nondeterministic choices it should make. More formally, we can construct a regex multispanner that, for every , nondeterministically chooses some , where is the NSST’s nondeterministic branching factor, and puts the empty span into . On the level of multiref-words, this simply means that every symbol is preceded by for some . Such an occurrence of can then be interpreted by the DSST as an instruction to select the nondeterministic choice when processing the next symbol from . In summary, we obtain the following result.
Theorem 8.
Given an NSST with states, input alphabet and nondeterministic branching factor , we can construct a regex multispanner over and and a regular function with . Moreover, is represented by a multispanner-expression with , is represented by a DSST with states, and both and can be constructed in time .
6 Evaluation of regular ET programs
In this section, we present the main technical result of the paper, regarding the evaluation of regular ET programs. Specifically, we consider the following enumeration problem where is a regex multispanner and is specified by a regular function.
Problem: Input: A document Output: Enumerate
Notice that is a bag; thus, the task is to produce an enumeration that contains each element of the bag exactly once, e. g., is a possible enumeration of .
As usual, we measure the running time in data complexity, namely, we assume that and are fixed. Given this assumption, we can assume that is given as a multispanner-expression, and as a DSST. Otherwise, we can convert and to satisfy this requirement.
For this problem, we strive for an enumeration algorithm with linear preprocessing and output-linear delay, i. e., in a preprocessing phase it receives the input and produces some data structure DS which encodes the expected output, and in the following enumeration phase it produces a sequential enumeration of the results from DS. Moreover, the time for the preprocessing phase is , the time for producing is less than , and the time between producing and is less than , for some fixed constant that does not depend on the input. As it is common [43], we assume the computational model of Random Access Machines (RAM) with uniform cost measure and addition and subtraction as basic operations [1]. We obtain the following result.
Theorem 9.
admits an enumeration algorithm with linear preprocessing time and output-linear delay.
Due to space restrictions, all the details and the analysis of the enumeration algorithm are deferred to the online version [38]. In the following, we highlight the main technical challenges. For running the algorithm, we use Theorem 7 and convert the pair into an NSST . This takes time exponential in ; nevertheless, it does not depend on and so we can consider it as constant time. Our enumeration algorithm aims for computing with linear time preprocessing and output-linear delay.
For evaluating over , the first challenge that we need to overcome is that its runs could maintain registers with content that is not used at the end of the run. For an illustration, consider the following NSST:
For each input word with -symbols and -symbols, the NSST outputs if ends with and if ends with . Consequently, every run on a word that ends with produces “garbage” in register , since the content of this register is not used for the output (and analogously with register for inputs that end with ). This behavior of storing “garbage” will be problematic for our enumeration approach, given that the delay depends on the (potentially useless) contents of the registers.
Given the above discussion, we formalize the notion of “garbage” as follows. Consider an NSST . For , let be the set of all registers that appear in . For let , namely, is the set of all registers used by . We say that is garbage-free if, and only if, for every string and every accepting run of the form (4) it holds that for every , and . In other words, the registers that we have filled with content so far coincide with the registers that we use on the right hand sides of the next assignment. The first challenge is to show how to make NSSTs garbage-free.
Proposition 10.
For every NSST , there exists a garbage-free NSST such for every string .
The construction of Proposition 10 causes a blow-up that is exponential in the number of registers, since we turn an NSST with states into an NSST with states. Of course, if we start with a garbage-free NSST, this blow-up can be avoided. Interestingly, we can show that one can check the garbage-free property in polynomial time.
Proposition 11.
Given an NSST , we can decide in time whether is garbage-free.
The second challenge is maintaining the set of outputs compactly for enumerating them. The main problem here is that the output of a run is not produced linearly as a single string (as is the case for classical spanners [8, 6]) but instead in parallel on different registers that are combined in some order at the end of the input. To solve this, we follow the approach in [30, 31] and present a non-trivial extension of Enumerable Compact Sets (ECS), a data structure that stores sets of strings compactly and retrieves them with output-linear delay. We modify ECS for storing sets of valuations and we call it ECS with assignments. For storing valuations, we use assignments in the internal nodes of the data structure, which allows us to encode the runs’ output and keep the content of different registers synchronous.
Using assignments in the ECS requires to revisit the whole approach of the data structure. For example, although we can eliminate the garbage from an NSST (cf., Proposition 10), the machine can still swap and move registers during a run, producing assignments that permute the registers’ content but do not contribute to a final output. We call these assignments relabelings and one of the main challenges is to take care of them. To solve this, we have to treat them as special objects and compact them in the data structure whenever possible. These extensions require to revisit ECS and to provide new ways for extending or unifying sets of valuations by also taking care of relabelings.
7 Composition of regular ET programs
A valuable property of regular ET programs is that they receive documents as input and produce documents as outputs. In this line, it is natural to think on reusing the output of one ET program to feed a second ET program , namely, to compose them. Formally, we define the composition as a function that maps documents to a bag of documents such that for every document :
Note that we use here the union of bags, which is defined in the standard way.
For extraction and transformation of information, it is useful to evaluate the composition efficiently. One naive approach is to evaluate (e.g., by using the evaluation algorithm of Section 6 in case that is a regular ET program), and for every output in compute , gathering all the outputs together. Of course, this could be time consuming, since could be exponential in in the worst case.
Towards solving the previous algorithmic problem, in the next result we show that every composition of NSSTs can be defined by an NSST. Formally, let us denote the input and output alphabet of some NSST by and , respectively. Given two NSSTs and such that , we define the composition as the function from documents to bags of documents: .
Theorem 12.
For every pair of NSSTs and such that , there exists an NSST such that and .
The statement of Theorem 12 for set semantics rather than bag semantics was obtained by [4, 5]. The novelty of Theorem 12 is that we extend the result to bag semantics; namely, we need to maintain the multiplicities of the final outputs correctly. The proof revisits (and simplifies) the construction in [4, 5]: we simulate over while runs over the registers’ content of , compactly representing subruns of by using pairs of states and assignment summaries [3]. The extension requires two changes for working under bag semantics. First, similar to the evaluation of NSSTs, garbage on -registers could generate additional runs for , producing outputs with wrong multiplicities. Therefore, our first step is to remove this garbage by constructing equivalent garbage-free NSSTs by using Propositon 10. Second, the construction in [4] guesses subruns of for each register content and each starting state. In the end, we will use one of these guesses, and then unused subruns could modify the output multiplicity. Therefore, we simplify the construction in [4] by guessing a single subrun for each register content, using the non-determinism to discard wrong guesses. Interestingly, this simplification overcomes the error pointed out by Joost Engelfriet in the construction of [4], which was solved recently in [5]. Our construction does not need the machinery of [5]; thus, an adaptation of our construction for NSSTs (with set semantics) can be considered as a simplified version of the proof in [4, 5]. We conclude by mentioning that as an alternative proof of Theorem 12 we could move to MSO transductions [16], extend the logic with a bag semantics, and then do the composition at a logical level; however, this strategy would lead to a non-elementary blow-up.
By combining Theorem 7, 12 and 8, we get the following corollary regarding the expressiveness of regular ET programs and their composition.
Corollary 13.
For every pair of regular ET programs and , there exists a regular ET program such that .
We conclude this section by the following corollary regarding the evaluation of the composition of regular ET programs, obtained by combining Theorem 12 and 9.
Corollary 14.
Given regular ET programs we can evaluate the composition with linear time preprocessing and output-linear delay in data complexity.
Finally, it is interesting to note that one could encode a multispanner as an NSST, by outputting directly for each multispan-tuple . Then Theorem 7 can be seen as a corollary of Theorem 12. However, the complexity of constructing an NSST from a regular ET program by using Theorem 12 will be worse than in Theorem 7.
8 Future Work
There are further research questions directly motivated by our results. First, the framework of multispanners deserves further investigation to understand which results and properties of classical spanners directly carry over to multispanners. Along the same lines, it will be interesting to understand the connections between classical spanners and subclasses of NSSTs. Second, one could consider other formalisms for the transformation phase, like the class of polyregular functions [10], that extends regular functions from linear to polynomial growth. Third, one can consider other algorithmic problems for the output, such as enumerating a set instead of a bag of results. Here, our algorithmic technique does not extend directly, since it would be required to remove duplicates in the NSST, or in the algorithmic evaluation, which is unclear. Last, the general setting allows for other application scenarios for the transformation phase, like producing graphs or relations. Furthermore, one can also extend the setting to consider input data with more structure, like nested documents (e.g., JSON, XML). We leave this and other open problems for future work.
References
- [1] Alfred V Aho and John E Hopcroft. The design and analysis of computer algorithms. Pearson Education India, 1974.
- [2] Rajeev Alur, Mikołaj Bojańczyk, Emmanuel Filiot, Anca Muscholl, and Sarah Winter. Regular Transformations (Dagstuhl Seminar 23202). Dagstuhl Reports, 13(5):96–113, 2023. doi:10.4230/DAGREP.13.5.96.
- [3] Rajeev Alur and Pavol Cerný. Expressiveness of streaming string transducers. In FSTTCS, pages 1–12, 2010. doi:10.4230/LIPICS.FSTTCS.2010.1.
- [4] Rajeev Alur and Jyotirmoy V. Deshmukh. Nondeterministic streaming string transducers. In ICALP, volume 6756, pages 1–20, 2011. doi:10.1007/978-3-642-22012-8_1.
- [5] Rajeev Alur, Taylor Dohmen, and Ashutosh Trivedi. Composing copyless streaming string transducers. CoRR, abs/2209.05448, 2022. doi:10.48550/arXiv.2209.05448.
- [6] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89–103. ACM, 2019. doi:10.1145/3294052.3319702.
- [7] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. SIGMOD Rec., 49(1):25–32, 2020. doi:10.1145/3422648.3422655.
- [8] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Transactions on Database Systems (TODS), 46(1):1–30, 2021. doi:10.1145/3436487.
- [9] Jean Berstel. Transductions and context-free languages. Springer-Verlag, 2013.
- [10] Mikolaj Bojanczyk. Transducers of polynomial growth. In LICS, pages 1:1–1:27. ACM, 2022. doi:10.1145/3531130.3533326.
- [11] Pierre Bourhis, Alejandro Grez, Louis Jachiet, and Cristian Riveros. Ranked enumeration of MSO logic on words. In 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, pages 20:1–20:19, 2021. doi:10.4230/LIPICS.ICDT.2021.20.
- [12] Vrunda Dave, Paul Gastin, and Shankara Narayanan Krishna. Regular transducer expressions for regular transformations. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, pages 315–324, 2018. doi:10.1145/3209108.3209182.
- [13] Johannes Doleschal, Benny Kimelfeld, and Wim Martens. The complexity of aggregates over extractions by regular expressions. Logical Methods in Computer Science, 19(3), 2023. doi:10.46298/LMCS-19(3:12)2023.
- [14] Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 149–163, 2019. doi:10.1145/3294052.3319684.
- [15] Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Logical Methods in Computer Science, 18, 2022. doi:10.46298/LMCS-18(1:21)2022.
- [16] Joost Engelfriet and Hendrik Jan Hoogeboom. MSO definable string transductions and two-way finite-state transducers. ACM Trans. Comput. Log., 2(2):216–254, 2001. doi:10.1145/371316.371512.
- [17] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37–48. ACM, 2013. doi:10.1145/2463664.2463665.
- [18] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM (JACM), 62(2):1–51, 2015. doi:10.1145/2699442.
- [19] Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. Efficient enumeration algorithms for regular document spanners. ACM Transactions on Database Systems (TODS), 45(1):1–42, 2020. doi:10.1145/3351451.
- [20] Dominik Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory of Computing Systems, 62:854–898, 2018. doi:10.1007/S00224-017-9770-0.
- [21] Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679–1754, 2019. doi:10.1007/S00224-018-9874-1.
- [22] Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 137–149, 2018. doi:10.1145/3196959.3196967.
- [23] Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners. In 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, pages 11:1–11:21, 2020. doi:10.4230/LIPICS.ICDT.2020.11.
- [24] Dominik D. Freydenberger and Sam M. Thompson. Splitting spanner atoms: A tool for acyclic core spanners. In 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), pages 10:1–10:18, 2022. doi:10.4230/LIPIcs.ICDT.2022.10.
- [25] Jeffrey Friedl. Mastering regular expressions. ” O’Reilly Media, Inc.”, 2006.
- [26] Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database systems - the complete book (2. ed.). Pearson Education, 2009.
- [27] Jerry R. Hobbs, Douglas E. Appelt, John Bear, David J. Israel, Megumi Kameyama, Mark E. Stickel, and Mabry Tyson. FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. CoRR, cmp-lg/9705013, 1997. URL: http://arxiv.org/abs/cmp-lg/9705013.
- [28] Lauri Karttunen. The replace operator. Finite-State Language Processing, pages 117–147, 1997.
- [29] Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125–136, 2018. doi:10.1145/3196959.3196968.
- [30] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1–19:18, 2022. doi:10.4230/LIPICS.ICDT.2022.19.
- [31] Martin Muñoz and Cristian Riveros. Constant-delay enumeration for slp-compressed documents. In ICDT, volume 255, pages 7:1–7:17, 2023. doi:10.4230/LIPICS.ICDT.2023.7.
- [32] Anca Muscholl and Gabriele Puppis. The many facets of string transducers (invited talk). In Rolf Niedermeier and Christophe Paul, editors, STACS, volume 126 of LIPIcs, pages 2:1–2:21, 2019. doi:10.4230/LIPICS.STACS.2019.2.
- [33] https://perldoc.perl.org/perlre, 2024. Accessed on 2024-09-16.
- [34] Liat Peterfreund. Grammars for document spanners. In Ke Yi and Zhewei Wei, editors, ICDT, volume 186 of LIPIcs, pages 7:1–7:18, 2021. doi:10.4230/LIPICS.ICDT.2021.7.
- [35] Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexity bounds for relational algebra over document spanners. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019., pages 320–334, 2019. doi:10.1145/3294052.3319699.
- [36] Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, pages 13:1–13:18, 2019. doi:10.4230/LIPICS.ICDT.2019.13.
- [37] Cristian Riveros, Nicolás Van Sint Jan, and Domagoj Vrgoc. Rematch: a novel regex engine for finding all matches. VLDB, 16(11):2792–2804, 2023. doi:10.14778/3611479.3611488.
- [38] Cristian Riveros, Markus L. Schmid, and Nicole Schweikardt. A framework for extraction and transformation of documents. CoRR, abs/2405.12350, 2024. doi:10.48550/arXiv.2405.12350.
- [39] Markus L. Schmid and Nicole Schweikardt. A purely regular approach to non-regular core spanners. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory, ICDT 2021, March 23-26, 2021, Nicosia, Cyprus, volume 186 of LIPIcs, pages 4:1–4:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICDT.2021.4.
- [40] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 153–165, 2021. doi:10.1145/3452021.3458325.
- [41] Markus L. Schmid and Nicole Schweikardt. Document spanners - A brief overview of concepts, results, and recent developments. In Leonid Libkin and Pablo Barceló, editors, PODS ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 139–150. ACM, 2022. doi:10.1145/3517804.3526069.
- [42] Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 79–89, 2022. doi:10.1145/3517804.3524158.
- [43] Luc Segoufin. Enumerating with constant delay the answers to a query. In ICDT, pages 10–20, 2013. doi:10.1145/2448496.2448498.
- [44] Panos Vassiliadis. A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM), 5(3):1–27, 2009. doi:10.4018/JDWM.2009070101.