PostgreSQL 10 promises easier partitioning to scale for big data. But how does it fare for time-series workloads? And how does that compare to TimescaleDB?
At TimescaleDB we also work on scaling PostgreSQL, but primarily for one specific type of big data workload: time-series data. We do this by building on-top of the partitioning functionality already in PostgreSQL.
One common question we hear is: How does PostgreSQL 10 fare for time-series data?
This post answers that question by examining the problems we faced in trying to use PostgreSQL 10 for the time-series workloads we often see, through a detailed comparison of partitioning approaches in PG10 and TimescaleDB.
Specifically, this post starts with an example, and then provides an examination of multi-dimensional partitioning. It follows that with a closer look at insert performance, query performance, repartitioning ease, and table management.
A brief introduction to TimescaleDB and hypertables
For anyone unfamiliar with TimescaleDB, it is a PostgreSQL extension for time-series data that provides automatic table partitioning with hypertables. Hypertables enable scalability by partitioning data in multiple dimensions (typically one time dimension and one or more space dimensions), but while otherwise looking and working just like a regular table.
In other words, a hypertable represents the abstraction of a single continuous table across all space and time intervals, such that a user can query and interact with all of their data as if it were in a normal table. This allows hypertables to grow to very large sizes that regular PostgreSQL tables simply cannot support, while retaining the familiar management functionality. Underneath the hood, hypertables rely on the older table inheritance mechanism of PostgreSQL that declarative partitioning is also built on.
So, with this shared DNA, how do the two different approaches of TimescaleDB and PostgreSQL 10 compare?
Example: From zero to query
The best way to answer the question above is to look at an example. Imagine an IoT time-series data scenario in which sensors are deployed in buildings to monitor room temperature and air humidity, ensuring optimal working conditions. A simple table schema for storing such data might look as follows:
CREATE TABLE conditions ( time timestamptz, temp float, humidity float, device text );
To scale up this table for a large amount of time-series data, we would need to partition the table in two dimensions: time and device. This reduces index sizes and allows for faster access by excluding sub-tables at query time. To do this, we will specify four device partitions while the time dimension remains open-ended with one week intervals.
With TimescaleDB you need only one additional function call to insert your first piece of data:
-- Turn ‘conditions’ into a hypertable partitioned by time and device SELECT create_hypertable(‘conditions’, ‘time’, ‘device’, 4, chunk_time_interval => interval ‘1 week’); INSERT INTO conditions VALUES ('2017-10-03 10:23:54+01', 73.4, 40.7, 'sensor3');
On insert, TimescaleDB automatically creates sub-tables, called chunks, according to the initial partitioning specification (note that we will use “chunks” to mean partitions/sub-tables in TimescaleDB). It will also create any indexes, constraints, or triggers on chunks that the user has declared on the hypertable, including defaulting to creating a time index to speed up queries.
Now, to achieve something similar with PG10 declarative partitioning you’d have to:
-- Create device partitions CREATE TABLE conditions_p1 PARTITION OF conditions FOR VALUES FROM (MINVALUE) TO ('g') PARTITION BY RANGE (time); CREATE TABLE conditions_p2 PARTITION OF conditions FOR VALUES FROM ('g') TO ('n') PARTITION BY RANGE (time); CREATE TABLE conditions_p3 PARTITION OF conditions FOR VALUES FROM ('n') TO ('t') PARTITION BY RANGE (time); CREATE TABLE conditions_p4 PARTITION OF conditions FOR VALUES FROM ('t') TO (MAXVALUE) PARTITION BY RANGE (time); -- Create time partitions for the first week in each device partition CREATE TABLE conditions_p1_y2017m10w01 PARTITION OF conditions_p1 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p2_y2017m10w01 PARTITION OF conditions_p2 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p3_y2017m10w01 PARTITION OF conditions_p3 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); CREATE TABLE conditions_p4_y2017m10w01 PARTITION OF conditions_p4 FOR VALUES FROM ('2017-10-01') TO ('2017-10-07'); -- Create time-device index on each leaf partition CREATE INDEX ON conditions_p1_y2017m10w01 (time); CREATE INDEX ON conditions_p2_y2017m10w01 (time); CREATE INDEX ON conditions_p3_y2017m10w01 (time); CREATE INDEX ON conditions_p4_y2017m10w01 (time); INSERT INTO conditions VALUES ('2017-10-03 10:23:54+01', 73.4, 40.7, 'sensor3');
As is evident by this example, declarative partitioning is quite powerful but also laborious to setup and configure.
And, for an open-ended dimension like time, it often doesn’t make sense to pre-create all partitions. For instance, if data volumes change, a table partitioned on a constant time interval will suddenly require each partition to hold more data, potentially over-sizing partitions. Our example above only covers the first week; partitions for other time periods would need to be created manually as needed. In addition, out-of-order data that falls outside the time ranges covered by current partitions would simply fail on insert, while TimescaleDB auto-creates chunks no matter where in time they fall.
Ease of use is not the only area where PG10 declarative partitioning falls short for time-series workloads. Let’s analyze and compare the partitioning approaches in more detail.
The PG10 and TimescaleDB examples above partition data in multiple dimensions, which is often desirable for increased throughput (e.g., storing each space partition on a different disk) or for data locality purposes (e.g., ensuring all measurements for a particular floor in a building ends up within the same space partition).
Let’s examine how each example is partitioning.
PostgreSQL 10 Partitioning
Option 1: Space / time partitioning
In our PG10 declarative partitioning example, we created a tree of tables, with the conditions table as root, followed by four device tables in the next sub-level, and then an arbitrary number of time tables in the third sub-level. The figure below shows a conceptual view of the multi-dimensional table partitioning.
Each sub-level in the tree corresponds to a set of tables with a
CHECK constraint that is inherited by children tables so that the leaf partitions have one
CHECK constraint for each partitioning dimension. The
CHECK constraints guarantee data integrity for a particular partition and allows the planner to do constraint exclusion to optimize queries. Note that only the leaf tables hold actual data.
Option 2: Time / space partitioning
Now, some might have already noted that we could have designed this differently, instead putting time at the first sub-level and device in the second sub-level as shown below. While this is another option, it also introduces challenges as we will see.
Evaluating both PG10 options
The choice of how to order the dimensions in the tree has implications for the number of tables one needs to create to expand the time dimension with new tables (and for repartitioning and dropping data, as we show later in this post).
Putting the device dimension at the first sub-level, as in the first option, requires creating four new tables every time interval (one for each space dimension). In the second option, we’d have to create five tables (one for the new time interval and four space sub-partitions). Additional dimensions will further complicate the possible partitioning strategies and require unnecessarily large inheritance trees with additional sub-levels.
All of these extra tables are likely to negatively impact performance and scalability as more tables need to be processed for both inserts and queries (more below with measurement graphs).
TimescaleDB: Time + space partitioning
TimescaleDB also relies on table inheritance, but avoids deeply nested multi-level inheritance trees. Instead, it creates a shallow tree of leaf chunks directly at the root irrespective of the number of partitioning dimensions, like shown below.
Each chunk has one
CHECK constraint per dimension and is created only when needed.
This is significant because it reduces the number of tables in the tree and avoids nesting. By doing so, TimescaleDB improves insert and query performance, simplifies repartitioning, and makes table management (and retention) easier.
The rest of this post examines each of these areas in further detail.
Insert performance: PG10 suffers as number of partitions increase
An important requirement of time-series data workloads is fast and consistent insert performance. Multi-dimensional partitioning is key to enabling such performance.
Multi-dimensional partitioning allows incoming data to be processed piecemeal, in more manageable right-sized chunks. In TimescaleDB, it allows the database to achieve fairly consistent write rates, regardless of data size and volume. This is helped by the fact that time-series data is mostly append-only, allowing writes to the latest chunks and indexes that can be kept “warm” in memory. For this purpose, TimescaleDB has a highly optimized insert path that routes tuples (data values) to the correct chunks during inserts.
Declarative partitioning in PG10 takes a big step forward in this regard as well, implementing its own tuple routing. Our review of the code, however, indicates that a single
INSERT statement always opens all leaf tables during inserts, including any indexes. For big data workloads, with thousands of sub-tables, this proves a scalability problem, because each and every one of those tables (including all indexes) needs to be processed and opened.And thousands of sub-tables is not uncommon for time-series data. As an example, one of our customers partitions time in five minute intervals to handle their unique data rates, which results in upwards of 50,000 chunks over just a few days.
Insert performance measurement graphs
To confirm our suspicions, we ran some time-series workloads through PostgreSQL 10 with the partitioning setup from our earlier example. We varied the number of partitions to see the effect on insert performance with batch sizes of 1 row (tuple) and 1,000 rows (tuples) per
The results can be seen below.
Clearly, insert performance using PG10 declarative partitioning suffers as the number of partitions grows, while TimescaleDB maintains a fairly constant rate.
But insert performance is not the only problem. Before starting each test, we also had to pre-create all of our partitions in PostgreSQL 10, since it doesn’t auto-create partitions for the user. Simply creating the partitions in the beginning of the test took several minutes, and we had to significantly increase
max_locks_per_transaction since PG10 has to acquire locks for all the partitions, even during insertion of just a single tuple.
In contrast, TimescaleDB creates a new chunk only when a tuple falls outside the boundaries of existing chunks. Further, it only keeps a limited number of chunks open per
INSERT, closing and opening chunks as needed in order to reduce the in-memory state and number of locks maintained during inserts. Therefore, TimescaleDB doesn’t suffer worse insert rates as the number of chunks grow. And, with mostly append workloads, and queries to recent time intervals, only a few chunks typically need to be open in memory at any one time.
In case data volumes do change over time, in TimescaleDB one can easily change the size of a chunk by simply setting a different partitioning interval for the time dimension; new chunks will be created with this updated interval. While this is a manual process right now, it involves changing only a single setting to tune the chunk size. Also, we are currently developing auto-tuning functionality that automatically adapts the interval of a time dimension to hit a given chunk target size, as data rates and volumes change.
Although insert performance is paramount with time-series data, query performance is important as well, because you also need to analyze all that inserted data in a performant manner. In this area, PG10 would benefit from additional work.
For example, TimescaleDB incorporates a number of time-oriented functions and query optimizations not available in vanilla PostgreSQL that make it easier and more efficient to work with time-series data.
These optimizations include, but are not limited to:
- A planner optimization that allows reuse of existing time indexes in group by queries. Without this optimization, these queries would have to use full table scans. Alternately, one could specify an expression index for every time-grouping (minutely, hourly, daily) used in such a query, but this would severely impact write performance.
- A query execution modification that can exclude chunks even if they contain non-constant expressions that normally turn into a scan of all sub-tables in regular PostgreSQL. For instance,
SELECT * FROM conditions WHERE time > now() - interval ‘1 day’;would be a full scan of all data in regular PostgreSQL, while TimescaleDB would only scan the chunks that have data for the given time interval.
time_bucket()function for grouping data by time. This is essentially a more flexible version of
date_trunc()that adds the ability to group by, e.g., 15 minutes instead of simply truncating the time value at the granularity of full weeks, days, hours, etc.
histogram()function that gives insight into the distribution of data when grouping by time, instead of, e.g., giving just simple averages.
As TimescaleDB evolves, the list of time-oriented query functionality will expand, and we are keenly listening to the invaluable feedback that our customers and community are giving us.
Related to performance: Range vs. Hash partitioning
Another key area where declarative partitioning is lacking relates to how the values are distributed across the partitions. PG10 declarative partitioning only supports range and list partitioning. On the other hand, TimescaleDB allows for hash partitioning, which makes partitioning easier by better distributing the entire range of values across the partitions.
Using declarative partitioning, one would distribute a high cardinality set of values using range partitioning. However, range partitioning can be tricky to get right because the partitions (1) must cover all possible values and (2) should ideally be evenly distributed across partitions.
In our example, inserting data using range partitioning with devices named
Xsensor (where X is a number) would result in all data ending up in the same partition. Even using an integer field for range partitioning (e.g., using an incremental ID) can be problematic, as it results in partitions being filled one-by-one as the ID increments.
Avoiding such uneven distribution without hash partitioning would require a more carefully designed and finer-grained range partitioning configuration. This mostly relegates range partitioning to fields that are more random in nature, like UUIDs, or applications where data locality is more important than spreading data evenly.
With multi-dimensional partitioning, one might not only want to change the interval of a time dimension, but also the space dimension(s).
Imagine that we’d want to repartition the device dimension to span eight partitions instead of four. (For example, if we mapped each device partition to a specific tablespace [disk] and now want to add four new disks.) This would not be possible with the declarative partitioning scheme in the example above, because the time dimension sub-tables are children to the parent device tables. We’d first have to remove all time sub-tables to be able to repartition the device dimension.
Changing the partitioning so that time is at the first level in the table tree would help, but is still not ideal. On one hand, this would allow us to expand the next time interval to eight device partitions spread across different tablespaces. Such a configuration may be preferable for repartitioning, but it still inflates the number of tables in the tree.
With its shallow inheritance tree, TimescaleDB avoids these issues, processing fewer tables and allowing easy repartitioning regardless of the number of dimensions due to lack of nested table dependencies. Repartitioning can be done by simply updating the partitioning configuration. The system only applies any partitioning changes to new data, helpfully avoiding the need to move existing data. TimescaleDB also makes it easy to associate tablespaces with hypertables so that the chunks can automatically spread across multiple disks.
Table management and retention
Managing tables is another important aspect of dealing with time-series data (and all data, really) in PostgreSQL: optimizing tables for queries; ensuring data integrity with constraints, unique indexes, foreign keys; extending functionality with triggers, and managing data retention.
Unfortunately, declarative partitioning has a number of limitations in this regard:
- Indexes, constraints, and triggers cannot be created on the root table (with few exceptions). These objects have to be manually created on each leaf partition.
- Unique indexes and constraints, including primary keys, are not supported on partitioned tables. This means one cannot do
ON CONFLICTclauses (e.g., upserts). Unique indexes are, however, supported on leaf partitions but only guarantee uniqueness across that particular table.
- Many management commands, like
REINDEX TABLE, do not typically recurse to sub-tables.
- Dropping old data can be laborious as it requires manually dropping old tables, and a partitioning scheme that ensures data consistency across space dimensions. This makes it harder to implement retention policies for time-series data.
TimescaleDB aims to make hypertables work similarly to regular tables, without the limitations listed above. For instance, indexes, triggers and constraints work as expected, recursing to chunks when created on the hypertable. Hypertables even support unique indexes and
ON CONFLICT clauses, as long as the unique index covers all partitioning columns. Retention policies are easy to implement with the
drop_chunks() command that drops all chunks beyond a given time horizon in a safe manner.
Will TimescaleDB ever adopt declarative partitioning?
While declarative partitioning is an important development for scaling traditional workloads, the improvements over regular table inheritance are not significant for time-series workloads. The declarative table creation is a nice new feature, but doesn’t make much of a difference for TimescaleDB since sub-tables are already created automatically under the hood.
The largest single feature of declarative partitioning is the addition of tuple routing, which TimescaleDB already implements in its own way. It would be nice to be able to leverage the tuple routing in PostgreSQL 10, but as pointed out in this post, there are some concerns that the current implementation doesn’t scale well with a large number of sub-tables. Therefore, it is unlikely that we will adopt this particular functionality of declarative partitioning in the near future.
There may be other reasons to adopt declarative partitioning, like leveraging specific query optimizations. At this time, however, we are unaware of any major query optimizations that are specific to declarative partitioning and which do not apply to regular inheritance tables.
To conclude, PostgreSQL 10 with declarative partitioning is a big step forward, but it is a general implementation that seems more targeted towards medium-sized traditional workloads than for typical time-series data.
Declarative partitioning may work well with workloads that have a simpler partitioning scheme (e.g., scaling a large customer table by partitioning on
customerID). It may also work well for data with a time dimension if the volume is not too large, and if out-of-order data is not expected (or if failed inserts are acceptable for out-of-order data).
If you are expecting larger time-series data volumes, however, you will probably want something that is more performant, provides time-oriented query optimizations, involves less manual work, and is more convenient.