Events should always be self-contained

Michael Seifert, 2020-03-31, Updated 2020-10-01

The term Event-Driven can mean different things for different people. Fowler (2017) (opens new window) identifies three categories of events. However, the topology misses to discuss the contents for each kind of event. I present a classification of events based on their payload and argue that events should always contain the whole state of the modified entity in the context of Distributed Systems.

Imagine a system with two different services, a customer service and an order processing service. The customer service manages customer entities and the order processing service needs to be aware of customer data, such as the shipping address. Both communicate using events. When a customer is added, deleted or modified, the customer service publishes events. There are three different types of payload that can be attached to the event: ^[1] The customer service may choose to include a reference to the modified customer, it may include only the customer properties that changed (e.g. an email address), or it may send the new state of the changed customer.

Classifying events by their payload

The simplest of events just notify the order processing service that something happened. They usually include some form of information, such as the ID (=reference) of the customer that was modified, for example:

{
  "id": 42
}

Java Swing events are a good example of what I call referential events. They are very lightweight, because they have almost no payload. When the consumer is interested in the referenced entity,^[2] it has to make a call to another system (e.g. the database) to get that information. This construct has several flaws: For one, every event consumer has at least two dependencies. For another, calls to other systems can go wrong in awfully many different ways, so we should try to avoid them.^[3]

These calls to external systems can be reduced if the customer service added actual information about the change to the even. When the customer updates their email address, the event would contain the new email address in addition to the customer reference:

{
  "id": 42,
  "emailAddress": "john.doe@example.com"
}

Such differential events spare us a network call to another system, but now the order processing service has to update its internal copy of the customer. This can make the implementation of the event consumer quite complex. There is a risk that the customer data in the order processing service may run out of sync with the true customer data, if the implementation is not robust enough.

We wouldn't have to put up with differential updates if the customer service simply sent us all customer data. The order processing service would receive self-contained events that effectively represent a copy of the modified entity. Consumers of self-contained customer events can write the customer state to their own databases^[4] and never need to fetch the entity from other systems. This implementation is much less error prone, because it is straightforward to write the correct state of the customer whenever an update arrives. Even if the customer only changed the email address, consumers would still receive all other customer data:

{
  "id": 42,
  "firstName": "John",
  "lastName": "Doe",
  "emailAddress": "john.doe@example.com"
}

I strongly recommend to use self-contained events wherever possible. When designing a data structure for an event ask yourself: Does this contain all information the consumer needs? The added complexity of differential events creates a huge maintenance burden and referential events often represent architectural flaws. I have personally seen the use of referential events greatly limiting the performance of event consumers. A process that could have been finished within milliseconds using a self-contained event took seconds or even tens of seconds instead.

Dealing with sensitive data and bloated events

As the system grows, more and more fields are added to the customer event. This includes sensitive credit card information which is needed by our newly introduced payment service. The order processing is still just interested in the shipping address, though. This poses two problems: the customer events become more inefficient over time, because they contain information that is not relevant to the respective consumer. Even worse, customer events contain sensitive data that leaks to parts of the system where it doesn't belong (i.e. the order processing service). We have to find a way to reduce event bloat and secure sensitive information. Assume the following data in a customer event:

{
  "id": 42,
  "firstName": "John",
  "lastName": "Doe",
  "emailAddress": "john.doe@example.com",
  "address": "13 Memory Lane",
  "zipCode": "12345",
  "cardNumber": "1234123412341234",
  "cardExpiration": "01/25",
  "cvc": "123"
}

Remember that the two consumers of the event are interested in different parts of the data: The order processing service needs to know the shipping address and the payment service requires the credit card information. That means we can split up the event into two new events, a ShippingAddress and a CreditCardInformation event. They could look as follows:

# ShippingAddress
{
  "id": 123,
  "firstName": "John",
  "lastName": "Doe",
  "address": "13 Memory Lane",
  "zipCode": "12345"
}

# CreditCardInformation
{
  "id": 456,
  "firstName": "John",
  "lastName": "Doe",
  "cardNumber": "1234123412341234",
  "cardExpiration": "01/25",
  "cvc": "123"
}

Note that splitting up the events required no changes to the way we store customers in our database. We only changed the data structures that are made available to our event consumers. They no longer receive data that is irrelevant to them. Therefore, we have increased efficiency at the transport level and more importantly we prevented sensitive information from being sent to parts of the software system where it shouldn't.

In summary self-contained events minimize the number of dependencies for event consumers and their network calls to other systems. Compared to the alternatives they result in low-maintenance code and potentially robust architectures. When self-contained events have many different consumers, they tend to become bloated, because they contain information that is not relevant to all types of consumers. The issue can be addressed by splitting it up into smaller events which are individually tailored to each type of consumer.

There is also the possibility to send no payload at all. While there may be legitimate use cases for such events, I think this type falls into the same category as "Events with references". ↩︎
The consumer will almost always be interested in the referenced entity. That is why we added the reference to the event in the first place. ↩︎
The exception to this rule is when the event references a large entity, like a binary file. In that case, we cannot include the whole information in a single event. ↩︎
These databases become more of a cache in the case of event sourcing ↩︎