The Pyttern Program Query Language
Abstract
Despite the availability of numerous tools and languages for detecting structural patterns in programs, their complexity often presents a steep learning curve. This highlights the need for a program query language that is easier to learn, use, and read while remaining sufficiently expressive for defining and detecting relevant structural coding patterns in program code. To address this challenge, we present Pyttern, a query language that extends Python syntax with regular-expression-inspired wildcards, enabling intuitive pattern-based querying of Python code. Its implementation relies upon a custom pushdown automaton describing how to match patterns over program parse trees, thus providing a robust foundation for structural code analysis. We evaluate Pyttern’s usability and effectiveness through a study involving 35 master’s students, who were asked to write seven different patterns to identify known programming misconceptions. The results demonstrate that Pyttern is both easy to learn and practical to use, at least for analysing small-scale programs.
Keywords and phrases:
Pyttern, Program Query Languages, Python, Pattern Matching, Parse Tree, Pushdown Automaton, Static Code Analysis, Wildcards, Tree Pattern MatchingCopyright and License:
2012 ACM Subject Classification:
Software and its engineering Software verification ; Software and its engineering Automated static analysis ; Software and its engineering Specialized application languages ; Software and its engineering Software maintenance tools ; Information systems Query languages ; Theory of computation Grammars and context-free languages ; Theory of computation Automata extensionsSupplementary Material:
Software (Source Code): https://github.com/JulienLie/Pyttern [10]archived at
swh:1:dir:1e349725fe7d18d7785112e0599682d5e7508ab7
Editors:
Jonathan Edwards, Roly Perera, and Tomas PetricekSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Understanding and analysing the structure of source code is a fundamental task in software engineering, particularly for code comprehension, quality assessment, and code improvement. Existing tools for querying source code often require their users to master complex query languages or formal abstractions, which is a barrier to entry for practitioners without prior experience in such systems.
This paper focuses on Pyttern, a program query language specifically designed to facilitate pattern-based querying of Python code. Unlike many existing program query languages and tools, where queries often correspond to parse tree visitors, Pyttern is a language for expressing structural patterns in Python programs using a syntax that closely resembles that of Python. The language extends Python syntax with expressive, regular-expression-inspired wildcards, enabling intuitive queries that are easy to write and read. By maintaining syntactic similarity with Python, Pyttern minimises the learning curve, making it more accessible to junior developers and non-expert software engineers.
Pyttern is an expressive pattern language, which raises the important question of how to design an efficient algorithm for matching its expressive patterns to program code. Key design criteria for this matching algorithm include the following: (1) it should be sufficiently generic to facilitate adaptation to programming languages other than Python; (2) it should support an easy addition of additional constructs to the pattern language, such as named wildcards; and (3) the algorithm should be sufficiently efficient in practice.
Our initial language prototype [12] did not satisfy all these requirements yet. In this paper, we introduce an extended version that does. It uses a generic parser to construct a parse tree from the program code, thus making the approach applicable to any language with an existing parser. We then define a new matching algorithm that operates on these parse trees. This algorithm is based on a novel pushdown automaton for trees, which describes how the matcher traverses the parse tree of the program being matched. This automaton allows for a generic, efficient and maintainable implementation of the matching algorithm.
This paper has two main objectives: (1) to provide a detailed account of the implementation principles and theoretical foundations underlying a matching algorithm suitable for a language like Pyttern, and (2) to assess the usability of Pyttern through a new validation study involving junior developers. To achieve these objectives, we describe the language syntax, the design decisions that guided its implementation and then discuss its underlying implementation architecture and formalism. We then conduct an empirical evaluation in which 35 master’s students used Pyttern to express queries aimed at detecting seven known programming misconceptions.
The remainder of this paper is structured as follows: Section 2 provides background on related program query languages. Section 3 presents and illustrates the Pyttern language. Section 4 discusses Pyttern’s implementation architecture and underlying formalism. Section 5 presents our empirical validation and results and discusses the implications of our findings. Section 6 summarizes our contributions and outlines directions for future work.
2 Related Work
In this section, before presenting our Pyttern language, we present some related languages that served as a source of inspiration for Pyttern’s language design and implementation.
First of all, code search engines and languages like FaCoY [8], Lancer [22], and SLAMPA [23] use code snippets as their query format and attempt to find matching or relevant code based on the given snippet. This philosophy that “code queries code” makes these search tools simpler and more developer-friendly than static analysis tools that require more structured queries. Such snippet-based methods allow natural code input instead of requiring developers to learn a complex query language, thus focussing on developer convenience, rather than deep program analysis with more exhaustive correctness guarantees.
Code with holes takes a more structured approach by supporting “holes” (placeholders) in queries. Instead of relying on a search engine to determine which code fragments match a given code snippet, this approach represents partial programs using variables or wildcards to specify unknown parts [16, 13, 14]. Some approaches extend this idea further by introducing custom query languages that significantly modify or extend an existing programming language [6, 9].
This contrasts with more traditional program query languages (like CodeQL [4, 20, 1] or Datalog-based tools [18]) and static analysis techniques that rely on explicitly structured queries (e.g., “Find all methods with an integer parameter and a return type of float”), graph-based representations of code (like ASTs, control-flow graphs, or program dependence graphs), or even abstract interpretation or other formal methods for deeper semantic reasoning.
De Roover et al.’s [17] program query language tries to achieve the best of both worlds by mixing a powerful logic-based query language with the possibility of including concrete source code templates as part of the queries, and matching these queries against a combination of structural and behavioural program representations, including call-graphs, points-to analysis results and abstract syntax trees. The idea is that the source code excerpts embedded in the logic queries act as prototypical samples of the structure and behaviour they intend to match. To use this approach, however, a developer is still required to be aware of the semantics of the underlying logic-based language and the various analysis techniques that it uses when matching queries, making it harder to use than pure snippet-based query languages.
The technique of regular expressions (RegEx) shares similarities with the concept of “code with holes” in that both allow for pattern-based matching with flexible or unknown components. Like code with holes which uses placeholders, regex wildcards (e.g., “.*”, “w+”) can be used to represent unspecified code fragments, enabling more flexible searches. Regular expressions, however, operate on flat text, whereas code with holes is typically applied at the abstract syntax tree (AST) level, allowing for structure-aware and semantically meaningful queries. This makes code with holes more expressive and precise than regex-based approaches, as it respects the syntactic and semantic rules of programming languages rather than relying solely on textual patterns.
Codegex [21] and many other program query languages and analysis tools [5, 2] leverage regular expressions to query and analyse program code. Many lint tools [7], as well as other program analysis tools, operate by first parsing source code into a tree representation and subsequently traversing this tree to examine the program’s structure.
Pyttern draws significant inspiration from SCRUPLE [16], a tool designed for querying C source code through high-level patterns. SCRUPLE introduces a symbolic representation that allows syntactic entities in C code to be substituted with placeholders (“holes”). The expressiveness of its pattern language stems from four key types of pattern symbols: wildcards representing individual syntactic entities, wildcards for groups of entities, named wildcards, and additional specialized features.
To perform pattern matching, SCRUPLE employs an extended nondeterministic finite state automaton referred to as the Code Pattern Automaton (CPA), which also operates on an input tree. SCRUPLE’s algorithm provided inspiration for the matching algorithm implemented in Pyttern. However, the full theoretical basis underlying SCRUPLE has not been published by its author [16] nor in follow-up work. We found in our earlier work [12] that knowledge of the details of this theoretical basis is crucial in a practical implementation. For instance, when matching labeled wildcards, it is important to define precisely how the automaton would treat such labeled wildcards. With our paper, we aim to provide pattern matching on generic parse trees a firmer theoretical basis that allows for efficient, generic and maintainable implementations of languages such as Pyttern and SCRUPLE, while also allowing for a deeper future theoretical analysis.
3 Pyttern
Pyttern’s goal is to provide a program query language that balances ease of use and expressiveness. It shares similarities with code snippet-based search engines, “code with holes” approaches, and other query languages that support concrete source code templates. Like these approaches, Pyttern aims to minimise the learning curve by staying close to the syntax of the programming language being analyzed. To make the language easy to learn, Pyttern’s syntax is a mere extension of Python syntax, augmented with dedicated wildcards for pattern matching. Despite its simplicity, Pyttern remains expressive, offering powerful wildcards inspired by regular expressions. It performs matching at the level of parse trees, ensuring robust and flexible pattern recognition beyond mere syntactic similarity.
| Wildcard | Meaning |
|---|---|
| ? | Matches a single node in the parse tree. |
| ?* | Matches zero or more nodes in a parse tree list node. |
| ?{n, m} | Match between n and m nodes. |
| ?name | Matches a single node and names it for later reference. |
| ?[Type, …] | Match a parse tree node of any of the given types. |
| ?: | Matches the body of this wildcard at one nesting level. |
| ?:* | Matches the body of this wildcard at any nesting level. |
The list of wildcards supported by Pyttern, our dedicated Python pattern matching query language, can be found in Table 1. We blend those wildcards within the Python language making it easy to learn for any developer that knows a bit of Python.
An example of a typical coding idiom in an imperative language like Python is an accumulator, of which an example is given in Listing 1. In essence, an accumulator pattern refers to a technique where one first initializes an accumulator variable (line 2), and then iterates over a sequence (line 4), updating the accumulator variable with the result of some operation during the iteration (line 6). In the end, the accumulator variable that retains the cumulative result of all these operations is returned as result (line 7).
The structure of this coding idiom can be expressed straightforwardly as a Pyttern pattern, as illustrated in Listing 2. The Pyttern pattern follows the syntax of a Python program, but with wildcards acting as placeholders for matches. For example, the function name, loop variable and sequence, and the value added to the accumulator are abstracted by a single node wildcard ? on lines 1, 4 and 6, respecitively. The accumulator variable is represented by a named wildcard ?accumulator on lines 2, 6 and 7, which ensures that in multiple places the same variable name is matched. The function’s parameter list on line 1 is abstracted by the ?* wildcard matching a list of nodes. And finally the ?:* wildcards on lines 3 and 5 provide the flexibility for the statements nested in the body of that wildcard to appear at any nesting level.
4 Implementation
We now provide a more detailed explanation of the current implementation of Pyttern.
Overall Architecture
As previously mentioned, Pyttern’s language design and pattern matching strategy are inspired by those of SCRUPLE. Pyttern’s overall architecture, depicted in Figure 1, consists of four main components: two parsers (one for the Pyttern pattern and one for the Python program being matched), a custom push-down automaton (PDA), and a matcher which uses the PDA to visit the Python parse tree looking for matches, and returns a match set. Intuitively, the PDA describes the set of transitions the matching algorithm needs to follow to walk through a Python program in order for it to match the Pyttern pattern described by that PDA. Each match in the final match set is a valid set of transitions that leads to a match between the Pyttern pattern and the Python code being matched.
Parsers
To facilitate reasoning about nested programming constructs, Pyttern employs tree matching. This requires Python programs to be parsed into a parse tree. To do so, we chose ANTLR due to its maturity and the availability of an already defined Python 3 grammar.111https://github.com/antlr/grammars-v4/tree/master/python/python3 ANTLR uses an adaptative LL(*) parser [15] to generate a parse tree for a given Python program. Figure 2 shows the parse tree produced for the Python program in Listing 1.
For parsing Pyttern expressions, which are essentially Python programs with specific wildcards, we extended ANTLR’s Python3 grammar by adding extra production rules for these wildcards. Pyttern expressions can then be transformed into a parse tree. It is these two parse trees, the Python and the Pyttern one, which will then need to be matched.
We decided to use ANTLR due to its widespread use, its ease of use, and the grammars available for it. This allows us to define a technique that can be reused for other languages. It should allow us to adapt Pyttern to other languages that are already defined by ANTLR or even to use it to create grammars for languages not yet supported by ANTLR.
Custom Push-Down Automaton
The next step in our approach concerns the transformation of the Pyttern pattern, referred to as a pyttern, in an automaton that is used by our matching algorithm. An example of such an automaton is given in Figure 3. It is produced by a visitor that walks through the pyttern parse tree and generates the PDA depending on the wildcards it encounters. Due to space limitations, we cannot describe here in full detail how each node in the pyttern parse tree is transformed into PDA transitions. Nevertheless, we will explain these transformations intuitively through the example that we will use in the following paragraphs.
Formalism
Traditional Push-down Automata (PDA) are capable of recognizing context-free grammars (CFG) in strings. However, in our approach the input of the matching algorithm is a parse tree, as this allows the use of standard parsers to create such trees. Hence, we already have information with respect to the nested nature of the code, which could allow to find matches of patterns more efficiently. To build our matcher, we therefore propose a PDA that operates directly on the parse tree of Python source code.
Our custom pushdown automaton can be described as a 6-tuple where:
-
Q is a finite set of states generated by the pyttern to PDA transformer;
-
is the alphabet of labels of the nodes in the trees we will parse;
-
is an additional alphabet representing pyttern labeled wildcards (;
-
is the transition relation, described below, indicating the program of the automaton;
-
is the start state;
-
is the end state.
Each transition (instruction in the program) is a 6-tuple where:
-
is the current state;
-
, indicating a label or labeled wildcard condition on the current node in the input tree in order to execute the transition;
-
indicates a condition on a symbol on the top of the stack of the automaton222We chose to denote Body and to denote Indentation. These labels are arbitrary and primarily reflect the parse tree structure, rather than the complete syntactic structure of the code.;
-
is the navigation direction;
-
is the next state where the automaton should move to;
-
is the stack symbol to push to the stack after the transition.
The PDA uses the transitions to move from one configuration to another. A configuration in our PDA is defined by a 4-tuple , where:
-
is a state in the PDA;
-
is the node that we are currently considering in the Python parse tree, where is the set of nodes of the parse tree;
-
is a string of symbols currently on the stack333In the stack, will be used to denote a node with multiple children (thus, when coming back up with parent, we will stop at this node) and will be used for node with single child (allowing us to go back up until ).;
-
is a function indicating a mapping from named wildcards to a node in the Python parse tree.
Please note that the amount of information stored in one configuration is limited, making searching through a space of configurations efficient.
The initial configuration of the automaton is , where is the root of the Python parse tree. In other words, the automaton starts in its starting state on the root of the input tree, with an empty stack and an empty mapping function. We can transition from one configuration to another, denoted by if there is a in the program of our PDA that allows this. The requirements specified in the transition need to hold as follows:
-
depending on the label condition specified in the transition, one of the following holds:
-
–
: the transition does not impose a condition on the input label in order to be executed;
-
–
: the transition imposes a condition on the input label . If , where is the label of node in the input tree, the label condition is satisfied; intuitively, if the transition specifies an input label in , we must find that input label at the current node in the input in order to move to the state ;
-
–
: the transition imposes a labeled wildcard condition. If (i.e., we have already matched the labeled wildcard previously), and is isomorphic to , the labeled wildcard condition is satisfied; intuitively, if the transition specifies a wildcard in , the structure and the labels of the subtree below the current node must be isomorphic to the subtree and labels below the previous node that was matched for the same named wildcard;
-
–
-
or , where is the last symbol of the stack ; intuitively, if the transition specifies a stack symbol, we must find that stack symbol at the top of the stack;
-
is the next node in the input tree to consider, defined as follows:
(1) -
if , or if , where corresponds to without the last symbol, and indicates the concatenation of and ; in other words, we add the symbol in the transition at the top of the stack, and remove the top of the stack if we detect a required symbol at the top of the stack;
-
if and ; otherwise, . In other words, if the transition specifies a label in , and that label in is not mapped to a node in yet, this mapping is stored.
If one of the conditions of a transition is not met, we cannot use that transition to move to another configuration. We define that the PDA matches the input tree if there is a sequence of transitions from the starting configuration to an end configuration , where is the end state and is the last node in the input tree.
Matcher
The matcher component of Figure 1 finally boils down to finding a sequence of transitions for this automaton. Let us illustrate a sequence of transitions that parses a pattern on a concrete example. In the example of Figure 2, we can match the tree using these transitions from the starting configuration to an end-configuration using the PDA from Figure 3:
| (Transition 1) | (2) | |||||
| (Transition 2) | (3) | |||||
| (4) | ||||||
| (Transition 15) | (5) | |||||
| (Transition 16) | (6) | |||||
| (Transition 17) | (7) | |||||
| (8) | ||||||
| (Transition n-1) | (9) | |||||
| (Transition n) | (10) | |||||
| (Transition n+1) | (11) | |||||
| (12) | ||||||
| (Transition f-1) | (13) | |||||
| (Transition f) | (14) |
In these transitions, we specifically illustrate how our mapping for named wildcards operates. As shown in transitions , when a named wildcard is encountered and is not yet present in the mapping , it is added to . Conversely, transitions , , and demonstrate the alternative scenario: when the named wildcard is already in , we verify that the structure and labels of the subtree below the current node are isomorphic to those of the subtree below the previously matched node for that named wildcard.
Our PDA is non-deterministic; for finding such a sequence of transitions we implemented a depth-first backtracking search that branches when multiple transitions are possible. While parsing CFG with PushDown Automata can generally be done in [3], adding the named wildcards and the mapping can make our search space explode. This makes our custom PDA non-polynomial when using named wildcards and each named wildcard used would make this complexity grow bigger. We already use some optimisations like dynamic programming in our matching algorithm but we will not go into all details here. For more implementation details, check the project on GitHub: https://github.com/JulienLie/Pyttern.
5 Validation
To validate our Pyttern language, we asked 35 CS master students to write Pyttern patterns, or pytterns in short, to detect symptoms of 7 distinct misconceptions in a dataset of potential answers to a first year bachelor course. Table 2 lists the different misconceptions for which we asked to write a pyttern. These misconceptions where selected based on a previous study that we made where we detected multiple patterns in student’s code [11]. We tried to make a selection of patterns with different level of complexity regarding the creation of pytterns.
| Misconception | Description |
|---|---|
| NoEmptyInit |
A class containing an __init__ method with pass as only statement.
|
| InitReturnsObject |
A class C where the __init__ method returns an object of the class itself, i.e. return C().
|
| EraseArg | When a function parameter gets reassigned to a different value, often a constant, within the function’s body. |
| LoopVarUnused | A program containing a loop, with a loop variable that is not used inside this loop. |
| IncrForLoop | A program containing a for-loop, where the loop variable is incremented manually. |
| EarlyReturn |
A program contain a return statement in the body of a loop, causing the loop to terminate prematurely.
|
| PrintReturn |
A function that has no return statement at the end but just a print statement.
|
To obtain a controlled test set, we generated 507 code samples containing symptoms of specific misconceptions for the first year bachelor course, using the technique described by Steveny et al. [19], which defines a set of transformations to be applied to a base code set. By iteratively applying these transformations, we produced a series of code samples which contain the various misconceptions to be detected. We also included dummy transformations that did not introduce any misconceptions, ensuring our test set to contain a mix of both relevant and irrelevant variations to the base code set. With this technique, eventually we obtained a validation set containing 132 (26%) programs with no misconceptions, 249 (49%) with one misconception and 123 (25%) with two different misconceptions.
Before starting to implement their “pytterns” we also asked our study participants about their background knowledge on some concepts related to programming and program analysis. We aggregated these results in Figure 4. From this graph, we can observe that our participants considered themselves quite knowledgeable in Python though not experts. They also have some knowledge on design patterns (from a software engineering course which they followed). However, they had little or no experience with using AST visitors or other program query languages to analyse programs for relevant structural patterns. We thus consider our study participants as non expert programmers with very limited background in program analysis.
To understand how non-expert programmers adopt and utilise Pyttern, we analysed the queries they implemented. An example of a pyttern they wrote is shown in Listing 3. This pyttern matches the InitReturnsObject misconception found in Listing 4. We compared their pytterns against a manually validated set of reference pytterns (written by the language designer). This evaluation intended to determine how well our study participants could write Pyttern patterns, comprehend the language’s syntax, and use the different wildcards provided by the language. We analysed their implementations to understand some of the challenges they faced and to get an idea of the learning curve of the language.
From the 35 students who attempted to use Pyttern, (only) 29 were able to write at least one valid pyttern, and 28 successfully matched at least one file (containing a misconception to be detected). We ran each student’s pytterns on our validation set and compared their results with our ground truth. Figure 5 presents a heatmap of F1 scores for the detection rate of each misconception across all files, allowing us to quickly identify which misconceptions students detected most effectively and which posed greater challenges.
From Figure 5 we can observe that certain misconceptions were consistently well-identified across most files, whereas others show lower F1 scores and appear more challenging to capture. In particular, some misconceptions display strong clustering of high scores, indicating that many students managed to write these patterns satisfactorily. These corresponded to patterns that could be defined with a simple Pyttern expression. Not surprisingly, the ones with lower score corresponded to patterns that required more complex pytterns or pytterns with more complex structures.
Meanwhile, Figure 6 displays a stacked bar plot of F1 scores per participant, allowing us to assess how each participant performed across the seven misconceptions to be detected. There seems to be quite some variation in the ability of participants to use Pyttern to write patterns for detecting misconceptions. Whereas some participants (on the left) seem successful at writing patterns for most misconceptions, others (on the right) only manage a few. We can also observe from this figure again that for some misconceptions most participants managed to write valid pytterns, whereas for other misconceptions this was much less the case.
Nevertheless, our results do show that the majority of study participants successfully developed valid pytterns, in spite of their limited prior knowledge about program query languages. Their previous knowledge of Python helped them to understanding and effectively utilise Pyttern, at least for simpler patterns, suggesting that the language design facilitates a gentle learning curve. This result seems to suggest that Pyttern can be adopted by users with basic Python skills without too much difficulty. The limited scope and duration of the experiment unfortunately did not allow us assess whether, with more time and experience, the participants would have been able to write more advanced pytterns successfully.
Although the misconceptions for which we asked our participants to write Pyttern checkers were presented in a certain order, they had the freedom to implement them in any order, within a given time period (4 hours). Nevertheless, we checked whether the order in which the misconception were presented affected the mean F1 score for this misconception. This analysis is summarised in Figure 7. We performed a one-tailed Pearson correlation test between the order of the question and the mean F1 score. Our hypothesis was that later questions could yield a lower F1 score (in other words that students would focus first on the misconceptions that were presented first). This Pearson test gave us an of and a of . This means that we probably have a moderate-to-strong negative relationship between the order of the question and the mean F1 score, even though we cannot claim that the correlation is there with an of .
Obviously there are still many threats to validity of this validation of our new implementation of Pyttern. First, the participant sample used for this evaluation was limited and may not accurately represent the intended population of potential users nor their level of expertise that Pyttern is designed for. Additionally, the version of Pyttern used by our participants was not entirely bugfree during the evaluation. This could have led to participants writing buggy pytterns and missing matches. (It could also explain why 6 participants were not able to write any valid pyttern.) Some (few) participants also had previous experience with Pyttern, as they participated in an earlier experiment with the first prototype of the Pyttern language a year earlier. This may have biased their performance in the positive sense, as the learning curve would have been lower for them. Finally, the set of misconceptions that we selected for this validation might not fully capture the range of possible patterns that we could define with Pyttern. It could also be not representative for the set of all possible misconceptions that we could find in small programs.
6 Conclusion and Future Work
In this paper, we introduced a new formalism for Pyttern, a novel approach to pattern matching in Python source code using a modified nondeterministic pushdown automaton. We describe the automaton in detail, along with the foundational principles of this new version of Pyttern, which we believe establishes a robust basis for further improvement. Our formalism presents a general technique for parse tree pattern matching that could be adapted for other programming languages. Pyttern is an implementation of this formalism in Python to investigate whether defining a query language closely aligned with the syntax of the programming language being analyzed (e.g., Python) can assist non-expert developers in writing concise patterns.
Our validation study with master students suggests that the learning curve for Pyttern is generally good, though there is room for improvement. Participants with little or no experience in program query languages were able to implement simple pytterns using their previous Python knowledge. However, the creation of more complex patterns proved challenging, highlighting areas where further improvements in the language’s design could enhance usability. Overall, the successful implementation of different patterns and the constructive feedback received indicate a promising path forward for Pyttern.
While the examples and evaluations presented in this paper are based on single-source files, the underlying approach is not inherently limited to this setting. The current implementation focuses on per-file analysis to simplify parsing and transformation workflows. Extending the system to support multi-file projects primarily requires addressing cross-file symbol resolution and coordinated pattern matching, which are orthogonal to the main contributions of this work. We leave this direction for future work.
Future work will focus on refining the ease of expressiveness of Pyttern, in particular its ability to compose patterns in terms of more simple patterns and combining patterns with logical operators. Also, for now, Pyttern is totally based on Python. We plan to extend the idea to other programming languages and transform the core features of Pyttern (i.e. the PDA and the Matcher) to a pattern matching framework for any language. In fact, we already started this endeavour as we are currently implementing a Java version of Pyttern.
References
- [1] Codeql website. URL: https://codeql.github.com/.
- [2] Spotbugs website, 2006. URL: https://spotbugs.github.io/.
- [3] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Time and tape complexity of pushdown automaton languages. Inf. Control., 13(3):186–206, 1968. doi:10.1016/S0019-9958(68)91087-5.
- [4] Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Schäfer. QL: object-oriented queries on relational data. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming, ECOOP 2016, July 18-22, 2016, Rome, Italy, volume 56 of LIPIcs, pages 2:1–2:25. Schloss-Dagstuhl-Leibniz Zentrum für Informatik, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.ECOOP.2016.2.
- [5] Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix, and William W. Pugh. Using static analysis to find bugs. IEEE Softw., 25(5):22–29, September 2008. doi:10.1109/MS.2008.130.
- [6] Katsuro Inoue, Yuya Miyamoto, Daniel M. Germán, and Takashi Ishio. Code clone matching: A practical and effective approach to find code snippets. CoRR, abs/2003.05615, 2020. doi:10.48550/arXiv.2003.05615.
- [7] Stephen C Johnson. Lint, a C program checker. Bell Telephone Laboratories Murray Hill, 1977.
- [8] Kisub Kim, Dongsun Kim, Tegawendé F. Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. Facoy: a code-to-code search engine. In Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman, editors, Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 946–957. ACM, 2018. doi:10.1145/3180155.3180187.
- [9] Julia Lawall and Gilles Muller. Coccinelle: 10 years of automated evolution in the linux kernel. In Haryadi S. Gunawi and Benjamin C. Reed, editors, Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018, pages 601–614. USENIX Association, 2018. URL: https://www.usenix.org/conference/atc18/presentation/lawall.
- [10] Julien Liénard. Pyttern. Software, swhId: swh:1:dir:1e349725fe7d18d7785112e0599682d5e7508ab7 (visited on 2025-08-28). URL: https://github.com/JulienLie/Pyttern, doi:10.4230/artifacts.24594.
- [11] Julien Liénard, Kim Mens, and Siegfried Nijssen. Extracting unit tests from patterns mined in student code to provide improved feedback in autograders. In Andrea De Lucia, Dario Di Nucci, Valeria Pontillo, and Gilberto Recupito, editors, Proceedings of the 15th Seminar on Advanced Techniques & Tools for Software Evolution, University of Salerno, Computer Science Department - Fisciano (Salerno, Italy), June 12 to 14, 2023, volume 3483 of CEUR Workshop Proceedings, pages 48–56. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3483/paper5.pdf.
- [12] Julien Liénard, Kim Mens, and Siegfried Nijssen. Pyttern: a python-based program query language. In Gilles Perrouin, Benoît Vanderose, and Xavier Devroey, editors, Proceedings of the 23nd Belgium-Netherlands Software Evolution Workshop, Namur, Belgium, November 21-22, 2024, volume 3941 of CEUR Workshop Proceedings, pages 88–96. CEUR-WS.org, 2024. URL: https://ceur-ws.org/Vol-3941/BENEVOL2024_TECH_paper10.pdf.
- [13] Alon Mishne, Sharon Shoham, and Eran Yahav. Typestate-based semantic code search over partial programs. In Gary T. Leavens and Matthew B. Dwyer, editors, Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012, pages 997–1016. ACM, 2012. doi:10.1145/2384616.2384689.
- [14] Rohan Mukherjee, Chris Jermaine, and Swarat Chaudhuri. Searching a database of source codes using contextualized code search. Proc. VLDB Endow., 13(10):1765–1778, 2020. doi:10.14778/3401960.3401972.
- [15] Terence Parr, Sam Harwell, and Kathleen Fisher. Adaptive ll(*) parsing: the power of dynamic analysis. In Andrew P. Black and Todd D. Millstein, editors, Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2014, part of SPLASH 2014, Portland, OR, USA, October 20-24, 2014, volume 49, pages 579–598. ACM, 2014. doi:10.1145/2660193.2660202.
- [16] Santanu Paul and Atul Prakash. A framework for source code search using program patterns. IEEE Trans. Software Eng., 20(6):463–475, 1994. doi:10.1109/32.295894.
- [17] Coen De Roover, Theo D’Hondt, Johan Brichau, Carlos Noguera, and Laurence Duchien. Behavioral similarity matching using concrete source code templates in logic queries. In G. Ramalingam and Eelco Visser, editors, Proceedings of the 2007 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-based Program Manipulation, 2007, Nice, France, January 15-16, 2007, pages 92–101. ACM, 2007. doi:10.1145/1244381.1244398.
- [18] Bernhard Scholz, Herbert Jordan, Pavle Subotic, and Till Westmann. On fast large-scale program analysis in datalog. In Ayal Zaks and Manuel V. Hermenegildo, editors, Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016, CC ’16, pages 196–206, New York, NY, USA, 2016. ACM. doi:10.1145/2892208.2892226.
- [19] Guillaume Steveny, Julien Liénard, Kim Mens, and Siegfried Nijssen. Feedback with bert: When detecting students’ misconceptions becomes automatic. In Responsible Knowledge Discovery in Education (RKDE), 2024.
- [20] Tamás Szabó. Incrementalizing production codeql analyses. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors, Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, pages 1716–1726. ACM, 2023. doi:10.1145/3611643.3613860.
- [21] Xiaowen Zhang, Ying Zhou, and Shin Hwei Tan. Efficient pattern-based static analysis approach via regular-expression rules. In Tao Zhang, Xin Xia, and Nicole Novielli, editors, IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, pages 132–143. IEEE, March 2023. doi:10.1109/SANER56733.2023.00022.
- [22] Shufan Zhou, Beijun Shen, and Hao Zhong. Lancer: Your code tell me what you need. In 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019, pages 1202–1205. IEEE, IEEE, 2019. doi:10.1109/ASE.2019.00137.
- [23] Shufan Zhou, Hao Zhong, and Beijun Shen. SLAMPA: recommending code snippets with statistical language model. In 25th Asia-Pacific Software Engineering Conference, APSEC 2018, Nara, Japan, December 4-7, 2018, pages 79–88. IEEE, IEEE, 2018. doi:10.1109/APSEC.2018.00022.
