The Maintainability Benefits of Property-Based Testing

Traditional tests are example-based. Many different examples are needed to cover all edge cases of a system under test, which often leads to bloated test suites that eventually become a maintenance burden. This article presents the idea of property-based testing using an HTTP API as example. It shows that property-based tests result in concise and more maintainable test code.

The Shopping List Application

The following snippet presents a running web application based on FastAPI. Package imports were left out for improved readability.

app = FastAPI()

class ShoppingItem(BaseModel):
    name: str

shopping_items = []

async def list_items() -> Sequence[ShoppingItem]:
    return shopping_items"/item")
async def add_item(
  item: ShoppingItem) -> Sequence[ShoppingItem]:
    return item

The application manages an in-memory list of shopping items and provides an /item route where existing shopping items can be retrieved and new ones added. [1] Let's send some requests and see what happens.

$ curl -d '{"name": "Butter"}' http://localhost:8000/item
$ curl -d '{"name": "Milk"}' http://localhost:8000/item
$ curl http://localhost:8000/item

As we expected, the POST response contains the added shopping item. The GET response contains a list of all items that were added to the shopping list.

Testing the application

We could write an example-based test for the above curl commands. But that would not be sufficient to test the happy path: Does GET /item correctly return an empty list when no items have been added? What if the name of a shopping item contains special characters? Each of these would usually require a separate test case and we could end up with something like this:

def test_get_items_returns_returns_empty_list_when_no_items_are_present():
    response = http.get("/item")

    returned_items = response.json()

    assert returned_items == []

def test_get_items_supports_special_characters_in_item_name():
    name_item_with_special_characters = "~MyÍtêm~"
    item_with_special_characters = dict(name=name_item_with_special_characters)"/item", json=item_with_special_characters).raise_for_status()

    response = http.get("/item")

    returned_items = response.json()
    assert returned_items == [item_with_special_characters]

def test_get_items_returns_all_added_items():
    items = [
    for item in items:"/item", json=item).raise_for_status()

    response = http.get("/item")

    returned_items = response.json()
    assert returned_items == items

In a property-based test, we can test all of these scenarios at once. Rather than thinking of different edge cases, take a step back and think of the desired behaviour of the system under test. On property of our example system is that a shopping item appears in the item list after we added it.

http = TestClient(app) # Provided by FastAPI

# Hypothesis strategy that creates items with pseudo-random names
shopping_item = builds(ShoppingItem, name=text())

def test_get_items_returns_added_items(items: Sequence[ShoppingItem]):
    for item in items:"/item", json=item.dict()).raise_for_status()

    response = http.get("/item")

    returned_items = parse_obj_as(Sequence[ShoppingItem], response.json())
    assert len(returned_items) >= len(items)
    for expected_item in items:
        assert expected_item in returned_items

The test case receives a list of pseudo-random shopping items generated by Hypothesis. The items are added using an API call and we raise an exception if one of the calls fail. When we retrieve the list of shopping items, we expect the result to contain all items we added (or possibly more [2]). The test case is run many times with different lists of shopping items as input.

I always emphasized pseudo-randomness, because Hypothesis does not generate examples at random. It tries to find edge cases. In the test case above, we can be sure that the test is executed with an empty list of input items. Hypothesis will also vary the contents of shopping items and is likely to create items with empty names, ASCII names, and unicode characters. We effectively substituted at least three example-base tests with this property-based test.


To be fair, there are also drawbacks to our property-based test. By default, the test is non-deterministic. That means it will will test a different set of examples each time it is executed [3]. It may be undesirable that your Continuous Integration pipeline fails, just because Hypothesis found an edge case that is never relevant for productive use. [4]

As you move to property-base tests you might observe increasing runtimes of your test suite. On the other hand, the tests are excercising the code in more different ways and cover more edge cases, so I suppose these two points even each other out.

The biggest win in my opinion is that property-based tests tend to be more concise than example-base tests. That means we can reach the same test coverage and make the same assertions about the system under test with less code. Since test code tends to get refactored less than application code, this is a huge win in terms of maintainability.

Personally, I cannot recommend Hypothesis enough. I have yet to see a library of similar quality for other languages [5]. In fact, when choosing a language or technology for a service or project, Hypothesis was tipping the scale in favour of Python more than once. I highly recommend looking into Hypothesis or a property-based testing library for a language of your choice when you want to reduce the maintainance effort of your software tests.

  1. FastAPI makes use of pydantic for serializing ShoppingItem objects from an to JSON automatically. That's why we can simply return a list of objects in the code and get a list of JSON objects in the response. ↩︎

  2. This is not a unit test of the list_items function, but an integration test of the whole web application. FastAPI's TestClient uses the actual ASGI app to execute the test. Since Hypothesis generates different shopping lists and performs different test runs, there may be items in the shopping list from the last test run. Therefore, we cannot check that the result is the exact same list of items we added, but have to use a more relaxed assertion. ↩︎

  3. Hypothesis does cache generated examples. When an example causes a failing test, developers can rely on Hypothesis to replay the same set of example, so that the test fails deterministally. ↩︎

  4. Note the non-determinism can be disabled with appropriate Hypothesis settings. ↩︎

  5. I heard ScalaCheck is pretty good, but I haven't had the chance to dabble in Scala, yet. ↩︎