Abstract 1 Introduction 2 Background 3 Extending Codex with Mutation Testing 4 Related Work 5 Reflections and Further Work References

Exploring Mutation Testing for Teaching Introductory Programming

Pedro Vasconcelos ORCID LIACC, Departmento de Ciência de Computadores, Faculdade de Ciências, Universidade do Porto, Portugal
Abstract

This paper proposes the use of introductory programming assignments based on mutation testing where students are asked to write tests rather than code. We believe such exercises can be used to teach code reading skills before students could write the corresponding programs on their own. Furthermore, feedback for such exercises can be automatically generated using testing tools. We have extended an existing web-based system for programming exercises with such mutation testing assignments and show some example use cases. This is on-going work that has yet to be validated in the classroom.

Keywords and phrases:
mutation testing, programming education
Copyright and License:
[Uncaptioned image] © Pedro Vasconcelos; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Social and professional topics Computing education
; Software and its engineering Software testing and debugging
Editors:
Ricardo Queirós, Mário Pinto, Filipe Portela, and Alberto Simões

1 Introduction

Many introductory programming courses in undergraduate level employ some form of automatic feedback tool to help students with their learning process. The most common approach is to require students to submit code and then check it against a number of test cases prepared by the instructor [7]. While such systems can be beneficial to the learning process, they often also present some pitfalls:

  1. 1.

    Students typically start with only a problem description and a blank code editor, leaving them on their own about how to proceed.

  2. 2.

    After an erroneous submission, students often resort to trial-and-error, trying to fixing reported errors in an “ad-hoc” way rather than taking a step back and re-assessing the problem.

  3. 3.

    The reliance on instructor-provided tests may lead students to believe that the sole purpose of the assignment is to pass the tests rather than to build up an understanding that can be applied in future situations.

  4. 4.

    The availability of generative AI tools such as ChatGPT and CoPilot that can solve introductory programming problems automatically means that unsupervised students may ask the AI tool for help and obtain a solution that passes the tests, even if they do not understand it. This can give a false sense of progression that does not translate to effective learning [6].

This paper proposes an alternative type of programming exercise were students are given the code and asked to write the tests. The general form of such exercises is the following: given a correct version of a program, and one or more incorrect ones, produce a suite of tests such that the correct program is accepted and the incorrect ones are rejected. The idea is inspired by mutation testing, a well-known Software Engineering methodology for assessing the quality of test suites [4].

When teaching introductory programming, however, our primary focus is on understanding rather than engineering. Nonetheless, we believe that such style of exercises can still be beneficial for beginners:

  1. 1.

    By starting from a problem description and code fragments to analyse, we avoid the initial difficulty of translating from an informal idea to a formal notation.

  2. 2.

    By focusing on the differences between code fragments, we emphasize code reading skills which are often negleted in programming courses [1].

  3. 3.

    Because test suites can be assessed automatically, we can still provide fast and scalable feedback to such exercises.

  4. 4.

    The ability for students to came up with relevant test cases is a useful skill in itself for writing and debugging programs.

As a proof-of-concept, we have extended the Codex system to allow conducting such exercises. The remaining of this paper describes this extension. However, at the time of writing this approach has not yet been tried in the classroom.

2 Background

Codex is a web-based system for setting up programming assignments with automatic assessment [10]. The system supports different programming languages (Python, Haskell, Java, C and SQL) and types of assignments (input-out tests, unit tests, property tests). It has been developed by the author of this paper and is used in several programming courses at the Faculty of Science of the University of Porto. The source code and user guide are available at https://github/pbv/codex.

Codex is organized around structured documents: both index pages and exercises are written as Markdown files.111Using many extensions (e.g. mathematics) allowed by the Pandoc library: https://pandoc.org/ This means that exercises can be composed using a standard text editor, archived in version control systems, copied and compared for differences, etc. The system is otherwise minimal: the server renders pages, accepts submissions, stores them in a SQLite database, runs tests and renders reports. There is a minimal admin interface, but most of the setup and configuration is done via command-line tools.

Doctest222https://docs.python.org/3/library/doctest.html is a Python library for writing tests of functions, classes or methods embedded in the documentation. The library makes uses of Python’s reflection capabilities to discover tests automatically. Listing 1 shows an example of three unit tests included in the documentation string of a number squaring function. The tests mimic the interactive use of the Python interpreter (and indeed such examples can be copied from an interative session), making it more suitable for beginners than conventional testing frameworks because there is little extra bureaucracy for writting tests.

Listing 1: Example usage of doctest for a squaring function.
import doctest
def square(n):
’’’Compute the square of a number.
>>> square(2)
4
>>> square(3)
9
>>> square(5)
25
’’’
return n*n
if __name__==’__main__’:
doctest.testmod() # discover and run all tests

Since Doctest it is part of the Python standard packages, no extra dependencies have to be installed, simplifying its adoption by beginners. The library also exposes an advanced API that allows reading tests from a string or file (as opposed to documentation strings); this is used by the Codex doctest tester and also the new mutation tester presented in Section 3.

Program mutation [4] (also known as mutation testing) is a technique for assessing the quality of a test suite by generating many small variations of the code under test (called mutants) and checking that these are rejected by the test suite (said to have been killed). A mutant that survives the test suite corresponds to either: (1) a benign change that preserves the program behavior; or (2) some input combination that is not properly tested. Thus, the amount of mutants killed by a test suite can be used as a comparative indicator of test quality.

Mutated programs can be automatically derived from the original using some heuristics, e.g. changing a comparison such as less-than-or-equal (<=) to less-than (<) or vice-versa, or increasing constants by one. Several libraries exist for automatically generating and testing mutants from a codebase of specific programming languages.333E.g. MutMut for Python (https://github.com/boxed/mutmut) and PIT for Java (https://pitest.org/). Mutation testing is often taught in University courses on software testing or project assignments [3, 8]. It is not, however, typically covered in introductory courses.

3 Extending Codex with Mutation Testing

We have extended Codex with a new mudoctest tester for mutations of Python code. The tester requires that the instructor defines both the correct program (the target) and one or more incorrect mutants, while students submit the doctests. Mutants are not automatically generated; this choice is deliberate so that the instructor can choose appropriate variations for the intended skill level and learning objectives.444Nonetheless, it would be still possible to generate mutants automatically as a separate offline step.

3.1 Sample exercise: conditionals

Figure 1: A sample exercise on conditionals.

Consider the sample assignment of Figure 1 based on an example from [5]. Such an exercise could be presented after introducing conditional execution, and is meant to alert students to the incorrect handling of negation (essentially failing to apply De Morgan’s laws).

Figure 2: Sample report on a successful submission.

To sucessfully complete the assignment the students should submit a test case such that only one of the two conditions fail, e.g.

>>> attack(0.75, 100)
Your attack has no effect

Figure 2 show the feedback for such a submission. The report shows the test runs, exposing the incorrect behaviour of the mutant (the attack should have no effect if either condition fails).

Note that the submission would be rejected if the student had choosen an input leading to a successful attack (because in such a case both programs exhibit the same behaviour). This is a subtle point that beginners often misunderstand: programs may behave correctly for some inputs but still be incorrect.

Another possibility for failure is to choose an incorrect test case e.g.:

>>> attack(0.75, 100)
The dragon crumble in a heap

The correct definition A is rejected by this test case, hence it cannot be used to show that definition B is wrong.

Listing 2: Sample exercise description in Markdown.
1---
2tester: mudoctest
3language: python
4target: "cond0.py"
5mutants: [ "cond1.py" ]
6...
7
8# Negation of conditions
9
10Suppose we can slay the dragon only if our magic lightsabre sword is
11charged to 90% or higher, and we have 100 or more energy units in our
12protective shield. We find the definition A in the Python code
13of the game and someone attempts to simplify it to definition B.
14
15## Definition A (correct)
16
17~~~ {.include src="cond0.py" }
18~~~
19
20## Definition B (wrong)
21
22~~~ {.diffs from="cond0.py" to="cond1.py" }
23~~~
24
25Show that definition B is *not* equivalent to A by writing a single
26test case that accepts definition A but rejects definition B.

To setup such an exercise the instructor needs only prepare the Python files with the target and mutant code plus the Markdown file shown in Listing 2. Lines 1–6 define a YAML block of metadata for the Codex system. The remaining lines constitute the exercise description. Lines 17 and 22 are directives for Codex to include the Python source code of both the target and mutant functions, and display the later as a textual difference (to emphasize the changes made).

3.2 Run-length encoding

Consider now a more advanced exercise envolving nested loops, string indexing and building lists: compute the run-length encoding of a string. More concretely, we want a function encode that counts runs of identical characters and builds a list of pairs of characters and respective counts. For example, the input string encode(’aaabcc’) should yield [(’a’,3),(’b’,1),(’c’,2)]

def encode(text):
pairs = []
i = 0
while i<len(text):
j = i+1
while j<len(text) and text[j] == text[i]:
j = j+1
pairs.append((text[i], j-i))
i = j
return pairs
(a) Correct implementation.
def encode(text):
pairs = []
i = 0
while i<len(text):
j = i + text.count(text[i])
pairs.append((text[i], j-i))
i = j
return pairs
(b) Incorrect optimization.
Figure 3: Run-length encoding.

We give students the two definitions in Figure 3: a correct implemention of run-length encoding and incorrect attempt at optimization, replacing the inner loop with a use of the count method on strings. The objective is to submit a test case that shows that the two versions are not equivalent.

The students needs to study the programs and realize that the count method does not distinguish separate runs; this means a string such as ’aaabaa’ will be incorrectly encoded as having a run with 5 ’a’s. Hence, the test case should include at least two runs of the same character to exhibit the fault.

3.3 Testing

It is also possible to use mudoctest for exercises that explicitly focus on teaching testing: given a function defined with various branches, ask the students to come up with a good suite of tests, i.e. at least one test for each branch.

def reward(amount):
if amount < 0:
print("invalid amount")
elif amount <= 10:
print("bronze")
elif amount <= 20:
print("silver")
else:
print("gold")
def reward(amount):
if amount < 0:
pass
elif amount <= 10:
print("bronze")
elif amount <= 20:
print("silver")
else:
print("gold")
Figure 4: A function for computing rewards and one mutant.

Consider the functions in Figures 4; we could setup a mutation exercise using this code as target and four mutants which simply pass instead of printing in each of the four branches. Then the student’s test suite would only be accepted if it includes one value for each of the branches.

It is also possible to specify in the exercise metadata a maximum number of test cases that the student can provide. This has two advantages: it bounds the effort required of the student and prevents attempts that just keeping adding random test cases until the submission is accepted.

4 Related Work

Oliveira et al. [9] present results from an experiment on using a mutation testing tool for Pascal programs with 20 beginner students. They report better scores in a post-experiment code understanding survey by the group using mutation testing against the control group.

Clegg et al. [3] used a “Code Defenders” game for teaching software testing using mutations. Players alternate between attackers and deffenders; attackers submit mutants, while defenders submit test suites. The game assigns scores to players as they interact in these roles.

In his PhD thesis [2], Clegg investigated the use mutation testing for improving the quality of test suites used by instructors when automatically grading student’s code.

Mansur et al. [8] studied the effect of teaching mutation testing to students in a Data Structures and Algorithms programming project. They found that mutation testing had a positive effect on the quality of student’s test suites.

5 Reflections and Further Work

We believe that mutation testing in teaching introductory programming in the style proposed in this paper could be beneficial to students. A combination of factors enable this, namely, the lightweight syntax of Python doctest suites and the automatic feedback provided by Codex. However, there are a number of caveats that we anticipate:

  1. 1.

    Doctests scripts are interpreted textually, meaning that extra white space or the choice of delimiters (e.g. single or double quotes) may cause a test case to fail.

  2. 2.

    Runtime errors during the running of test suites might be confusing to students.

  3. 3.

    The terminology employed in the Codex generated reports is rather technical (e.g. “target” and “mutant”) and should perhaps be revised to better suit a beginner audience.

As further work, we would like to experiment with such exercises along more conventional ones were students are required to write code in a real classroom and get some feedback from students and instructors.

Another interesting direction is to investigate the use generative AI tools by students when solving mutation testing exercises. Some initial research exists on the use of AI tools for exercises about writing code [6], but it might not apply to exercises about writing tests. Our anecdotal experience suggests that e.g. ChatGPT can explain code differences and even suggest good test cases for simple programs. It would be worthwhile to investigate whereas students could benefit from such explanations for self study.

References

  • [1] Teresa Busjahn and Carsten Schulte. The use of code reading in teaching programming. In Proceedings of the 13th Koli Calling International Conference on Computing Education Research, Koli Calling ’13, pages 3–11, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2526968.2526969.
  • [2] Benjamin S. Clegg. The application of mutation testing to enhance the automated assessment of introductory programming assignments. PhD thesis, University of Sheffield, September 2021. Unpublished. URL: https://etheses.whiterose.ac.uk/id/eprint/30513/.
  • [3] Benjamin S. Clegg, Jose Miguel Rojas, and Gordon Fraser. Teaching software testing concepts using a mutation testing game. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering Education and Training Track (ICSE-SEET), pages 33–36, 2017. doi:10.1109/ICSE-SEET.2017.1.
  • [4] Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. Hints on test data selection: Help for the practicing programmer. Computer, 11(4):34–41, 1978. doi:10.1109/C-M.1978.218136.
  • [5] Allen B. Downey. Think Python: How to think like a computer scientist. Green Tea Press, 3rd edition, 2023. URL: https://greenteapress.com/wp/think-python-3rd-edition/.
  • [6] Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, Koli Calling ’23, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3631802.3631806.
  • [7] Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. Towards a systematic review of automated feedback generation for programming exercises. In Proc. of the 2016 ACM Conf. on Innovation and Technology in Computer Science Education, ITiCSE ’16, pages 41–46. ACM, 2016. doi:10.1145/2899415.2899422.
  • [8] Rifat Sabbir Mansur, Clifford A. Shaffer, and Stephen H. Edwards. Mutating matters: Analyzing the influence of mutation testing in programming courses. In Proceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 1, SIGCSE Virtual 2024, pages 151–157, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3649165.3690110.
  • [9] Rafael A. P. Oliveira, Lucas B. R. Oliveira, Bruno B. P. Cafeo, and Vinicius H. S. Durelli. On using mutation testing for teaching programming to novice programmers. International Conference on Computers in Education, November 2014. doi:10.58459/icce.2014.413.
  • [10] Pedro Vasconcelos and Rita P. Ribeiro. Using Property-Based Testing to Generate Feedback for C Programming Exercises. In Ricardo Queirós, Filipe Portela, Mário Pinto, and Alberto Simões, editors, First International Computer Programming Education Conference (ICPEC 2020), volume 81 of Open Access Series in Informatics (OASIcs), pages 28:1–28:10, Dagstuhl, Germany, 2020. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/OASIcs.ICPEC.2020.28.