Optimizing a PostgreSQL query – A case study

Michael Seifert, 2020-09-09, Updated 2020-10-07
On the left: The PostgreSQL elephant head in blue. On the right: Three interlocked cogs in a C-shaped alignment in the same blue color.

Let me give you a high-level overview of the software: The application in question ingested a bunch of events from other systems and stored them in a PostgreSQL database. The amount of data was not particularly large. The table storing the events had approximately two million entries. Each event has a creation date and a due date. Think of them like orders in an online shop that have order dates and delivery dates. It is worth noting that most of the events represent historic data, that means that the due and creation dates lie in the past.

The events were exposed via an HTTP API for use by other systems. The HTTP responses contained only a limited set of the events, in order to constrain the size of the payload. A typical response looked like this:

{
  "count": 4242,
  "next": "…",
  "previous": "…",
  "results": [
    {
      "id": 1234,
      "creation_date": "2020-09-01T13:37:00",
      "due_date": "2020-09-03T00:42:42",
    },]
}

The response consists of a JSON document that represents a subset of results, in order to limit the response size. The count field returns the total number of results, the next and previous fields contain links to the next or previous pages, and the results array contain the actual events. Each event has a numeric ID and one string for each the creation date and the due date in ISO format.

One day I discovered in our monitoring system that requests towards the HTTP endpoint took extremely long compared to other endpoints. 95% of the requests took more than 25s to complete. I started by verifying this on my own by issuing an HTTP call with curl:

$ time curl -s -H "accept: application/json" --url "https://api.example.com/events/" > /dev/null

real    0m26,834s

Obviously, that's way too slow for an API call. Curiously, subsequent calls only took three seconds, which is significantly faster:

$ time curl -s -H "accept: application/json" --url "https://api.example.com/events/" > /dev/null

real    0m3,099s

At this point, the reason could be anything, really. To narrow down the cause, I checked the PostgreSQL log. If configured correctly (see log_min_duration_statement (opens new window)), PostgreSQL will log all queries that take longer to execute than a specified threshold. Luckily, our database was configured accordingly and the corresponding query appeared in the logs:

SELECT *
FROM events
WHERE creation_date
  BETWEEN '2020-09-01'::date
  AND '2020-09-03'::date
ORDER BY due_date ASC
LIMIT 10;

The query is interested in the 10 events with the earliest due date which have been created in the last three days. According to the logs, the query took up a large portion of the total request duration. That means it is the best candidate for optimization.

Analyzing the SQL query using EXPLAIN

More often than not, these simple queries are slow, because some database index is missing. But indexes on both creation_date and due_date were present. Surely, PostgreSQL's query planner can shed some light on the execution:

db=> EXPLAIN (ANALYZE,BUFFERS) \
  SELECT * \
  FROM events \
  WHERE creation_date \
    BETWEEN '2020-09-01'::date \
    AND '2020-09-03'::date \
  ORDER BY due_date ASC \
  LIMIT 10;

Asking the query planner is as simple as prepending EXPLAIN to the SQL statement. This will show how the query planner would execute the SQL statement and estimate the runtime costs of each step. In this case, I provided the additional ANALYZE and BUFFERS options. ANALYZE actually executes the statement, and records execution time, and BUFFERS shows cache usage statistics.

The output of the query planner looks as follows:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..208.94 rows=10 width=341) (actual time=3229.530..3229.616 rows=10 loops=1)
   Buffers: shared hit=2312195 read=2484 dirtied=1
   ->  Index Scan using index_due_date on events  (cost=0.43..265274.91 rows=12722 width=341) (actual time=3229.528..3229.611 rows=10 loops=1)
         Filter: ((creation_date >= '2020-06-18'::date) AND (creation_date <= '2020-06-19'::date))
         Rows Removed by Filter: 1956485
         Buffers: shared hit=2312195 read=2484 dirtied=1
 Planning time: 0.374 ms
 Execution time: 3229.652 ms
(8 rows)

The query plan offers a number of things. For one, the Buffers line shows the state of the PostgreSQL buffers when processing the query. It counts the number of buffer blocks that were found in the cache (hit), read from disk (read), or modified (dirtied) for this query run. The fact that most data resides in the PostgreSQL buffers explains why subsequent queries are much faster. This is a good thing! However, the query processes an enormous amount of data. Each buffer block corresponds to 8 kB of memory. In total, that's 17.7 GiB of data from tables and indexes which are touched by the query. That means something is inherently wrong with this query and we need to dig deeper.

The Index Scan line in the query plan shows that the database uses index_due_date to retrieve events ordered by due date. These events are then filtered by creation_date, outputting only tuples that satisfy the WHERE clause of the query. After 10 rows have been found, the LIMIT clause is satisfied and the query is complete.

But why is filtering by creation_date not backed by an index? And why does the filter iterate over almost all rows in the table?

What's an index, really?

Imagine you are in an IKEA warehouse and you want to collect all parts required for an Ivar shelf. The shelf consists of two vertical supports on the sides and a number of shelf boards as well as a metal cross for diagonal support. You know that all the required parts are in the warehouse, but you need to look up the corresponding aisles in some sort of directory or catalog. Otherwise, you had to sift through the whole warehouse to find the parts. This is what database indexes are for: They retrieve data based on specific attributes, such as "things belonging to an Ivar shelf". In our case, we want to retrieve events by due_date. Just like directories and catalogs, indexes have an order. When only few data needs to be fetched, as specified by the LIMIT clause, this index order is leveraged (opens new window) to access data in an ordered fashion. This will make an additional sorting step obsolete. In our specific query, the query planner sees that we want to retrieve events sorted by due_date and uses the corresponding index to access the data.

Since we specified the sort order as ascending, we get the events with the earliest due_date, which are then filtered by creation_date as part of the index scan. However, the query planner does not know that the events with the earliest due dates are a couple of years old and the creation date is very recent. The resulting index scan starts by fetching the oldest events (i.e. earliest due dates) and discards them immediately, because the WHERE clause for the creation date is not fulfilled. It does so with almost two million table entries, until we finally get the set of events we are looking for.

The query planner is not at fault here, because it cannot know about the intricacies of the stored data. But we can come to its aid by rephrasing our SQL statement:

SELECT *
FROM (
    SELECT *
    FROM events
    WHERE creation_date
      BETWEEN '2020-09-01'::date
      AND '2020-09-03'::date
  )
ORDER BY due_date ASC
LIMIT 10;

By introducing a subquery, the query planner can leverage indexes for both creation_date and due_date. This results in a significant speedup of our database query.

Software systems keep changing and growing over their life time. So does the layout and the amount of data they store. Therefore, these kinds of performance bottlenecks need to be addressed frequently. Make sure, your team gets the necessary time to do so! When a software system is low-fricition to work with, performance comes naturally.