Property-based Testing #5: Shrinking Choices, Shrinking Values

Property-based Testing #5: Shrinking Choices, Shrinking Values

This is the fifth post in a series about property-based testing. This post describes "internal shrinking", a different implementation of shrinking that has some interesting advantages. Previous posts are:

  1. Property-based Testing: What is it anyway?

  2. The Essentials of Vintage QuickCheck

  3. Shrinking Take One

  4. Unifying Random Generation and Shrinking

The complete code for this post can be found on GitHub - in particular example.py and internal_shrink.py.

In the last two posts we tried shrinking values by directly changing values: we assumed an oracle that comes up with smaller candidate values, and search those in the hope to find some that still make the test fail.

I was going to keep this photo for when I become a captain of industry and write thought leadership posts. Oh well. Photo by Jon Tyson on Unsplash

We touched on one annoying problem with this approach: it doesn't deal well with mutable objects, because CandidateTree captures the randomly generated object as its root, and then lazily comes up with smaller candidates. If the object is mutated in between capture and finding a smaller candidate, things go sideways. We can try to work around this problem by asking users to clone mutable values, but we can do better - by changing our approach.

To introduce the main idea behind the new approach, we'll need some context - another perspective on how random generation works.

It's all about making choices

Let's check where in our mini-PBT library we make random choices. All we need to do is look up where we use Python's random module. There's only one place where it's called:

def int_between(low: int, high: int) -> Random[int]:
    return Random(lambda: random.randint(low, high))

(You may remember in the last post this was called random_int_between to disambiguate with tree_int_between. However, thanks to the implementation in this post, we won't need the tree_ functions anymore, so I've omitted the random_ prefix.)

When we generate a complex value like a list of Person objects, ultimately int_between chooses the length of the list, the letters of the persons' names and their age. The code of the generators makes that easy to see:

ages = int_between(0,100)
letters = map(chr, int_between(ord('a'), ord('z')))
simple_names = map("".join, list_of_length(6, letters))
persons = mapN(lambda a: Person(*a), (simple_names, ages))
lists_of_person = list_of(persons)

Reading bottom-up, i.e. in execution order, random choices are:

  1. list_of chooses a random length of the list.

  2. persons generates the first Person. It first generates a simple_name by randomly choosing 6 letters. Each letter is chosen by choosing a random integer, and then mapping that to a letter.

  3. A random age is chosen.

  4. Repeat 2 and 3 for each element of the list.

All told we make 7 choices per Person, and one choice to figure out how many Person objects we should generate.

Exactly how many choices there are is not important - the point is that we can think of creating a list of Person objects as a sequence of random choices, which are represented by a sequence of integers generated by int_between. The process of generating any random value is then a function that takes a sequence of choices as input, and creates a corresponding value as output.

We can use this insight to reproduce any randomly generated value. While generating a random value we record the sequence of choices the generator makes. To re-create the value we simply run the generator again, but we replay the same choices we made randomly before. You can think of the sequence of choices as the DNA of the generated value: choice sequence and generator uniquely determine the value.

Hopefully you're with me so far. The last step is to ask: is it possible to manipulate the choice sequence to make the generated value smaller? If we could do that, we'd effectively have a way to shrink.

Looking through the example, this seems possible. For example, if we make the choice of list length smaller, we'd generate shorter lists. If we make the choice of letter smaller, we'd generated smaller letters and names. We can also try more interesting changes to the choice sequence - for example we could generate a list of the same elements but in a different order by moving the sub-sequences that determine each person around. In principle, any change directly to the value can also be made by by changing the choice sequence instead.

It turns out this approach works well in practice. This method of shrinking was invented by David R. MacIver for the Python property-based testing library Hypothesis. He calls it "internal shrinking" as opposed to "external shrinking" which we've been discussing in the last two posts. Internal/external is relative to the random generation process: with external shrinking, we make some random choices and generate a value, which is then made smaller by directly manipulating the value. Shrinking is external to random generation. With internal shrinking on the other hand, we manipulate the sequence of random choices to indirectly get smaller values. Shrinking is internal to the random generation process, and runs the exact same generator code, just with a different choice sequence.

A simplified implementation

Let's look at this idea more concretely. We're not changing the way values are initially randomly generated, but we do need to somehow remember the sequence of choices that lead to that value, as that sequence is what we'll be using for shrinking.

We'll start with wrapping the call to random.randint so we can record the choice sequence. I've chosen to write a small class for this wrapper. There is more to it than I'm showing here, but we'll go piecemeal.

class ChoiceSeq:
    def __init__ (self) -> None:
        self.history: list[int] = []

    def randint(self, low: int, high: int) -> int:
        result = random.randint(low, high)
        self.history.append(result)
        return result

ChoiceSeq makes a random choice, and keeps the results. Now we need to change the implementation of int_between to use this class instead of random.randint directly. To do that we need to change Random to pass an instance of ChoiceSeq around so we can use it in int_between:

class Random(Generic[T]):
    def __init__ (self, generator: Callable[[ChoiceSeq], T]):
        self._generator = generator

    def generate(self, choose: ChoiceSeq) -> T:
        return self._generator(choose)

def int_between(low: int, high: int) -> Random[int]:
    return Random(lambda choose: choose.randint(low, high))

Other functions are essentially unchanged, e.g. here is map:

def map(func: Callable[[T], U], gen: Random[T]) -> Random[U]:
    return Random(lambda choose: func(gen.generate(choose)))

Since map doesn't have to make a new choice, it can just pass the ChoiceSeq through. The same is true for mapN and bind.

The types are also unchanged from before, as well as the implementation of for_all:

Gen = Random[T]

Property = Gen[TestResult]

def for_all(
        gen: Gen[T], 
        property: Callable[[T], Union[Property,bool]]
    ) -> Property:
    ... # unchanged from last post

However, test is where it gets interesting:

def test(property: Property):
    def do_shrink(choices: ChoiceSeq) -> None:
        ...

    for test_number in range(100):
        choices = ChoiceSeq()
        result = property.generate(choices)
        if not result.is_success:
            print(
                f"Fail: at test {test_number} with " 
                f"arguments {result.arguments}."
            )
            do_shrink(choices)
            return
    print("Success: 100 tests passed.")

We start out much the same as always: by generating up to 100 random values. However, thanks to the ChoiceSeq class, we've now got a record of all the random choices that were made during generation of each value. When all tests pass, we don't do anything with the choices, although you could imagine storing them in a database, as a reproducible log of the executed tests - at least provided the generators don't change too much!

For this post, we only care about do_shrink. Given a choice sequence, we now need to figure out how we're going to change it to make the resulting value smaller.

In a real implementation, this is a decent engineering challenge to do well. I'll just demonstrate the principle, by re-using our list shrinking function from a couple posts ago:

def shrink_candidates(choices: ChoiceSeq) -> Iterable[ChoiceSeq]:
    # this is part of the list shrinker from vintage.py!
    for i,elem in enumerate(choices.history):
        for smaller_elem in shrink_int(elem):
            smaller_history = list(choices.history)
            smaller_history[i] = smaller_elem
            yield ChoiceSeq(smaller_history)

The idea is that we take the history out of ChoiceSeq, which is a list of int, and create a new list with one of the elements made smaller using shrink_int. We'll need to change ChoiceSeq as well, which we'll do later. As far as do_shrink goes, an initial attempt looks as follows:

def do_shrink(choices: ChoiceSeq) -> None:
    for smaller_choice in shrink_candidates(choices):
        result = property.generate(smaller_choice)
        if not result.is_success:
            print(
                "Shrinking: found smaller arguments " 
                f"{result.arguments}"
            )
            do_shrink(smaller_choice)
            break
    else:
        print(f"Shrinking: gave up")

do_shrink takes the initial failing ChoiceSeq, generates smaller values via shrink_candidates and recursively shrinks if it finds a successful shrink.

What is going on in property.generate(smaller_choice)? In that call, we want the ChoiceSeq to replay the choices that were made previously, instead of randomly making new choices and appending them. We need to add replay behavior to ChoiceSeq:

class ChoiceSeq:
    def __init__ (self, history: Optional[list[int]] = None):
        if history is None:
            self._replaying: Optional[int] = None
            self.history: list[int] = []
        else:
            self._replaying = 0
            self.history = history

    def randint(self, low: int, high: int) -> int:
        if self._replaying is None:
            # recording
            result = random.randint(low, high)
            self.history.append(result)
            return result
        else:
            # replaying
            if self._replaying >= len(self.history):
                raise InvalidReplay()
            value = self.history[self._replaying]
            self._replaying += 1
            if value < low or value > high:
                raise InvalidReplay()
            return value

    def replay(self) -> None:
        self._replaying = 0

    def replayed_prefix(self) -> ChoiceSeq:
        if self._replaying is None:
            raise InvalidOperation()
        return ChoiceSeq(self.history[:self._replaying])

There's a lot going on here so bear with me.

ChoiceSeq now has two modes: record or replay. When created without a history argument, it is in recording mode - any call to randint generates a new random int and appends it to the list, as above. However, if replay is called explicitly, or a history parameter is given to __init__, then ChoiceSeq replays the history of choices without any random choices. In other words, creating a new ChoiceSeq(choices.history) and passing that to generate exactly re-creates the original value. The premise of internal shrinking is that we can generate smaller values by making the list of choices smaller.

There is however a snag - by changing choices earlier in the sequence, we'll influence the behavior of the generator. For example, the generator may need more history than the initial randomly generated value that was recorded, or an integer in the history is out of the desired bounds. In those cases, we fail with InvalidReplay - later on, we'll react to InvalidReplay by declaring the resulting value as an unsuccessful shrink. Not that if this happens, we bail out quickly, in most cases before we even run the test.

Vice versa, thanks to our devious manipulations, we'll hopefully use fewer choices - that indicates we've actually managed to make the value smaller. For example, if we make the length of a list smaller, then it's likely we won't use the the last part of the history. The method replayed_prefix extracts the part of the choice history that is used so far, so we can keep shrinking just that part.

With all that, here is the final version of do_shrink:

def do_shrink(choices: ChoiceSeq) -> None:
    for smaller_choice in shrink_candidates(choices):
        try:
            result = property.generate(smaller_choice)
        except InvalidReplay:
            continue
        if not result.is_success:
            print(
                "Shrinking: found smaller arguments" 
                f"{result.arguments}"
            )
            do_shrink(smaller_choice.replayed_prefix())
            break
    else:
        choices.replay()
        print(
            "Shrinking: gave up at arguments"
            f"{property.generate(choices).arguments}"
        )

There are three things to note here:

  • As explained, we react to InvalidReplay by considering the shrink as failed, and carry on with the next candidate.

  • If a shrink is successful, we carry on with the part of the choice sequence that's actually used by calling replayed_prefix.

  • To show the value of our last and most successful shrink, we use the replay method to reproduce the last value for display purposes.

And that is the essence of the internal shrinking approach! Let's see it in action next. But first: coffee and biscuit time.

white ceramic cup with cappuccino on white ceramic saucer Photo by Kelvin Han on Unsplash

Putting it to the test

We don't need to change usage of our API at all - generators and properties are implemented the same as before. This is an amazing achievement - the original QuickCheck API from over 20 years ago needs no changes despite one big refactor and one re-implementation. There's not many APIs I've designed that I expect to pass the test of time that well.

Let's try our new implementation on our usual example sort_by_age. If you need a refresher, see the first post in this series.

> test(prop_sort_by_age)

Success: 100 tests passed.

Hardly interesting. Shrinking is what we want to see!

> test(prop_wrong_sort_by_age)

Fail: at test 1 with arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57), Person(name='qyzwko', age=94), Person(name='iuzotw', age=4)],).
Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57), Person(name='qyzwko', age=94)],)

Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55), Person(name='mvrnyq', age=57)],)
Shrinking: found smaller arguments ([Person(name='pjahih', age=0), Person(name='iapxij', age=55)],)

Shrinking: found smaller arguments ([Person(name='ojahih', age=0), Person(name='iapxij', age=55)],)

Shrinking: found smaller arguments ([Person(name='njahih', age=0), Person(name='iapxij', age=55)],)
...
Shrinking: found smaller arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=2)],)

Shrinking: found smaller arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=1)],)

Shrinking: gave up at arguments ([Person(name='abaaaa', age=0), Person(name='aaaaaa', age=1)],)

Remember, wrong_prop_sort_by_age doesn't sort a list of Person objects by age, but by name and age. This particular shrink had about 160 successful shrinks out of 20,000 tries. It's definitely putting in the effort, and the result is not too shabby: it has found a minimal counterexample. In my cursory testing it comes up with a two person list with ages 0 and 1 most of the time (and names being some variant of aaaaaa and baaaaa). Sometimes it only finds a three person list.

This is a good result! The occasional misses don't indicate a shortcoming, just that my implementation is simple minded. By manipulating the choice sequence in smarter ways, it's possible to make this shrink to a single canonical example every time. If that doesn't work for some case, then we need to add smarter strategies - Hypothesis has an internal mini-language to describe how to manipulate the choice sequence for various cases - apparently, shrinking floating point values in particular needs special care.

Conclusion

It definitely seems like we have a winner on our hands. Compared to the previous approach we have the same outcome: we don't need to burden the user with writing shrink candidate functions in addition to generator functions. Since shrinking is now internal to random generation, the question of how to integrate random generation and shrinking doesn't even come up - the two are integrated at the core. Because of this tighter integration, internal shrinking has additional advantages:

  • Works great for mutable objects. Each shrink re-creates the object from scratch from the replayed choice sequence, running the exact same generator code as during random generation. As a result, this approach is much better suited for imperative-minded languages like Python, C, C# and Java.

  • Can reproduce shrunk values. So far it's been easy to reproduce randomly generated values, by storing the pseudo-random seed, but that doesn't work for any of the shrunk values. But now that we have the choice sequence, we can easily reproduce the shrunk values too.

  • Bind without remorse. In the last post, we noticed that if a generator is written using bind, the integrated approach loses a lot of information because it needs to randomly re-generate values for the inner generator. For example, if you write a generator for a list by randomly choosing a length, and then use bind to generate the values, every time it shrinks the length, the values of the list are randomly re-generated. With internal shrinking, this is not the case - the choice sequence can be changed so the exact same elements are re-generated. You can see this in the example above: the first few shrinks chop elements from the end of the list, while keeping the first elements the same. The end result is that internal shrinking is better at preserving information from the initial value that made the test fail.

All seems well. So, is this the final word in shrinking for property-based testing? Not entirely! I know of one more approach, which again has all the advantages of internal shrinking, with a simpler implementation. But for that, you'll have to wait for the next post...

As usual, comment below or discuss on Twitter @kurt2001.

Until next time!