postgres-partitioning-hero-img

Postgres Partitioning

Sean Hellebusch, Senior Staff Engineer, explains how to utilize the power of database partitioning to support data collection at scale.

Published Date: Apr 18, 2022
postgres-partitioning-hero-img

Our Hinge Health Experts

Sean Hellebusch
Camping, backpacking, overall nature entusiast, husband, dog dad, fitness enthusiast, nerd.

Postgres Partitioning

Like most technology companies, here at Hinge Health we collect user data not only to better serve our patients, but also to provide a data backed feedback loop for developing both our offerings and platform. And while we're very fortunate to have been quite successful, this does come with some technical costs; namely scaling up to support our growing user base. In order to meet demand, we've been advancing our architecture to optimize the way we store user data. A key aspect of the design was to utilize partitioned tables.

In this post, we'll start with an overview of the three partitioning strategies that Postgres supports. From there, we'll dive into some examples of how to create and query partitioned tables and end with some indexing optimizations.

The code examples will all be from the command line using the psql client. If you'd like to follow along, a quick way to get started is to use a docker image like so.

$ docker run -d -e POSTGRES_USER=admin -e POSTGRES_PASSWORD=password -e POSTGRES_DB=db -p 5432:5432 postgres:14 $ psql postgres://admin:password@localhost:5432/db

What is database partitioning?

Simply put, database partitioning is the process of taking very large tables and splitting it into smaller pieces. In addition to drastically improving query performance, partitioning a database provides future flexibility to detach old data for cold storage or migrating to namespaces with reduced computing power. Under the hood, Postgres will insert and query data by partition automatically, however there are some caveats which we'll dive deeper into.

Postgres supports three types of partitioning: range, list and hash. Range partitioning separates breaks out large tables into a set of ranges across a continuity. A common example is date or time based ranges (week, month, etc). List partitioning creates tables based on the value of a given key. In other words, records are put into buckets based on the value of a given field on the table. A good example is partitioning employees by a given store id. When ranges or lists fail, a hash partitioning can be used to evenly distribute records of data across a predetermined number of partitions. Hash partitioning is commonly used when there is no seemingly natural way to distribute data in an organized fashion.

In designing our new system for tracking user data, we've combined a number of tables into a single, abstract model that can accomodate any number of additional user triggered data types. The events that we store in the database are a timeseries of actions that we can use to rebuild a users daily activities. Because of this decision, we've decided to use range partitioning using time as the basis. The remainder of this blog will be focused on this strategy as it is a common use case.

Creating Partitioned Tables

Let's take a look at some example code for how this is done. First, in order to create a partitioned table, we need to create the logical table itself which will provide the base definition.

db=# CREATE TABLE events ( user_id int NOT NULL, for_date date NOT NULL, "timestamp" timestamp without time zone, data jsonb NOT NULL ) PARTITION BY RANGE (for_date); CREATE TABLE db=# \d List of relations Schema | Name | Type | Owner --------+--------+-------------------+----------- public | events | partitioned table | admin (1 row)

For our use case, we need the ability to query for user's data by a given date, so we decided to provide a timestamp for when the event took place and a for_date that coincides with the simplified YYYY-MM-DD for the day the event occurred. The date type is, in my opinion, simpler for querying; typing out the full timestamp is a bit painful and casting requires additional keystrokes. At this point, we have a partitioned table, which means we have no tables to insert data. If we try now, it'll fail.

db=# INSERT INTO events (user_id, timestamp, for_date, data) VALUES (123, CURRENT_TIMESTAMP, '2022-04-04', '{"some": "data"}'); ERROR: no partition of relation "events" found for row DETAIL: Partition key of the failing row contains (for_date) = (2022-04-04).

When partitioning, you'll want to be thoughtful about the size of your partitions. Too small and there is more overhead for the query planner; indexes too large can negate the optimization we’re applying. We decided to partition by month.

db=# CREATE TABLE events_042022 PARTITION OF events FOR VALUES FROM ('2022-04-01') TO ('2022-05-01'); CREATE TABLE db=# \d List of relations Schema | Name | Type | Owner --------+---------------+-------------------+----------- public | events | partitioned table | admin public | events_042022 | table | admin (2 rows) db=# \d events_042022 Table "public.events_042022" Column | Type | Collation | Nullable | Default -----------+-----------------------------+-----------+----------+--------- user_id | integer | | not null | for_date | date | | not null | timestamp | timestamp without time zone | | | data | jsonb | | not null | Partition of: events FOR VALUES FROM ('2022-04-01') TO ('2022-05-01')

Now, you can see we have an actual table. When creating partitions, be careful to choose a table naming strategy that makes sense for your use case (ex: event_MMYYY). Now we can insert some data into that partition. If our data is outside the bounds of the range set by the partition, it'll throw an error stating no partition exists, so be extra cautious to ensure they do.

db=# INSERT INTO events (user_id, timestamp, for_date, data) VALUES (123, CURRENT_TIMESTAMP, '2022-04-04', '{"some": "data"}'); INSERT 0 1 db=# INSERT INTO events (user_id, timestamp, for_date, data) VALUES (123, CURRENT_TIMESTAMP, '2022-06-01', '{"some": "data"}'); ERROR: no partition of relation "events" found for row DETAIL: Partition key of the failing row contains (for_date) = (2022-06-01).

You may have noticed that the range set in the initial table includes the first day of the month and the first day of the next month. This is because Postgres partitions have an inclusive lower bound and an exclusive upper bound. This is a common issue and can be very difficult to remediate once in production code, especially if you've already created a large number of tables. If you try to insert on the upper bound, it'll throw the same error as before.

db=# CREATE TABLE events_012022 PARTITION OF events FOR VALUES FROM ('2022-01-01') TO ('2022-01-31'); CREATE TABLE db=# INSERT INTO events (user_id, timestamp, for_date, data) VALUES (123, CURRENT_TIMESTAMP, '2022-01-31', '{"some": "data"}'); ERROR: no partition of relation "events" found for row DETAIL: Partition key of the failing row contains (for_date) = (2022-01-31).

If this does happen, you can alter the partition but it requires two SQL commands and therefore should be done in a transaction.

db=# BEGIN TRANSACTION; ALTER TABLE events DETACH PARTITION events_012022; ALTER TABLE events ATTACH PARTITION events_012022 FOR VALUES FROM ('2022-01-01') TO ('2022-02-01'); COMMIT TRANSACTION; db=# INSERT INTO events (user_id, timestamp, for_date, data) VALUES (123, CURRENT_TIMESTAMP, '2022-01-31', '{"some": "data"}'); INSERT 0 1

Querying Data

At this point, we have a couple partitions with a couple of rows. We can run a SELECT statement on the logical table and data will be pulled from all partitioned tables.

db=# SELECT * FROM events; user_id | for_date | timestamp | data ---------+------------+----------------------------+------------------ 123 | 2022-01-31 | 2022-04-04 20:09:29.750249 | {"some": "data"} 123 | 2022-04-04 | 2022-04-04 18:49:14.154143 | {"some": "data"} (2 rows)

Over time, querying data across accumulating partitions can become slow. Postgres solves this problem automatically so long as the partition key is used in the WHERE clause; the query planner will filter out partitions that do not fit within the range. We can see this in action when analyzing our query.

db=# EXPLAIN ANALYZE SELECT * FROM events where user_id = 123; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Append (cost=0.00..46.80 rows=10 width=48) (actual time=0.053..0.121 rows=1 loops=1) -> Seq Scan on events_012022 events_1 (cost=0.00..23.38 rows=5 width=48) (actual time=0.011..0.020 rows=0 loops=1) Filter: (user_id = 123) -> Seq Scan on events_042022 events_2 (cost=0.00..23.38 rows=5 width=48) (actual time=0.015..0.033 rows=1 loops=1) Filter: (user_id = 123) Planning Time: 0.069 ms Execution Time: 0.182 ms db=# EXPLAIN ANALYZE SELECT * FROM events WHERE for_date > '2022-04-01' AND user_id = 123; QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Seq Scan on events_042022 events (cost=0.00..26.05 rows=2 width=48) (actual time=0.023..0.044 rows=1 loops=1) Filter: ((for_date > '2022-04-01'::date) AND (user_id = 123)) Planning Time: 0.198 ms Execution Time: 0.112 ms

Indexing

Another way to optimize on partitioned tables, like your standard tables, is to utilize indexes. Indexes defined on partitioned tables are virtual; when a table is created from a partitioned table, it automatically creates a unique index for the table. This is an important feature; it allows Postgres to keep the indexes smaller and therefore more frequently accessed partitioned tables indexes can remain in memory while less accessed partitioned indexes can be stored on disk. Let's look at an example of how this would be done. Using the same table definition from earlier, we can add an index like so. Note the unique index for the for_date on the partitioned table is automatically generated.

db=# CREATE INDEX ON events (for_date); CREATE INDEX db=# \d events Partitioned table "public.events" Column | Type | Collation | Nullable | Default -----------+-----------------------------+-----------+----------+--------- user_id | integer | | not null | for_date | date | | not null | timestamp | timestamp without time zone | | | data | jsonb | | not null | Partition key: RANGE (for_date) Indexes: "events_for_date_idx" btree (for_date) Number of partitions: 0 db=# DROP TABLE events_042022 DROP TABLE db=# CREATE TABLE events_042022 PARTITION OF events FOR VALUES FROM ('2022-04-01') TO ('2022-05-01'); CREATE TABLE db=# \d events_042022 Table "public.events_042022" Column | Type | Collation | Nullable | Default -----------+-----------------------------+-----------+----------+--------- user_id | integer | | not null | for_date | date | | not null | timestamp | timestamp without time zone | | | data | jsonb | | not null | Partition of: events FOR VALUES FROM ('2022-04-01') TO ('2022-05-01') Indexes: "events_042022_for_date_idx" btree (for_date)

Now let's take a look at the same analyzed queries from before. Note the index is only used when the for_date is used in the WHERE clause.

db=# EXPLAIN ANALYZE SELECT * FROM events where user_id = 123; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Append (cost=0.00..46.80 rows=10 width=48) (actual time=0.055..0.181 rows=0 loops=1) -> Seq Scan on events_012022 events_1 (cost=0.00..23.38 rows=5 width=48) (actual time=0.013..0.034 rows=0 loops=1) Filter: (user_id = 123) -> Seq Scan on events_042022 events_2 (cost=0.00..23.38 rows=5 width=48) (actual time=0.011..0.039 rows=0 loops=1) Filter: (user_id = 123) Planning Time: 0.431 ms Execution Time: 0.241 ms db=# EXPLAIN ANALYZE SELECT * FROM events WHERE for_date > '2022-04-01' AND user_id = 123; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on events_042022 events (cost=6.83..22.19 rows=2 width=48) (actual time=0.035..0.067 rows=0 loops=1) Recheck Cond: (for_date > '2022-04-01'::date) Filter: (user_id = 123) -> Bitmap Index Scan on events_042022_for_date_idx (cost=0.00..6.83 rows=357 width=0) (actual time=0.014..0.024 rows=0 loops=1) Index Cond: (for_date > '2022-04-01'::date) Planning Time: 0.108 ms Execution Time: 0.248 ms

Design Considerations

When partitioning data, regardless of the partitioning strategy you choose, it's imperative to partition by the column(s) that will most often be used in your WHERE clauses. As mentioned above, the query planner will prune tables that do not fit within the bounds of the WHERE clause to speed up queries.

Additionally, the number of paritions can drastically impact query performance. Too few tables and indexes can become large, resulting in increased chances of cache misses. Too many tables will result in expensive query planning and therefore higher memory consumption during execution.

Summary

With some up-front planning, partitioning database tables is a powerful way to house very large datasets at scale. While there is a bit of maintanance to ensure tables are setup and the partition keys, ranges and indexes are carefully crafted, you can enjoy optimized data access for years to come.

Credits: Photo by Pawel Czerwinski on Unsplash