This is the first in a series of performance benchmarks comparing TimescaleDB to other databases for storing and analyzing time-series data.
TimescaleDB is a new, open-source time-series database architected for fast ingest, complex queries, and ease of use. It looks like PostgreSQL to the outside world (in fact, it’s packaged as an extension), which means it inherits the rock-solid reliability, tooling, and vast ecosystem of PostgreSQL.
But for time-series data, how does it compare to PostgreSQL itself? Put another way, how does one improve on an existing database stalwart with over 20 years of development?
By building something that is more scalable, faster, and easier to use, and opening it up to a variety of new use cases.
More specifically, compared to PostgreSQL, TimescaleDB exhibits:
- 20x higher inserts at scale (constant even at billions of rows)
- Faster queries, ranging from 1.2x to over 14,000x improvements for time-based queries
- 2000x faster deletes, critical for implementing data retention policies
- New time-centric functions, making time-series manipulation in SQL even easier
In short, TimescaleDB will outperform PostgreSQL for your needs if one or more of the following is true:
- You are insert bound (typically seen at the 10s of millions of rows, depending on allocated memory)
- Your queries are largely time-based and involve more than a single key-value lookup
- Deleting data is a big pain for you
- You are frustrated by the limitations of SQL for time-series analysis (examples below like: time bucketing, bi-temporality, etc.)
Below is a detailed set of benchmarks that compare TimescaleDB versus PostgreSQL 9.6 across inserts, queries, deletes, and ease-of-use. (If you would like to run the benchmarks yourself, here is the GitHub repo.)
(Update: at the time of this benchmark, PostgreSQL 10 was still in beta. We’ve since compared Timescale to PostgreSQL 10 for working with time-series data.)
Here’s the setup we used for all of these TimescaleDB vs PostgreSQL tests:
- Azure DS4 Standard (8 cores, 28GB memory) with network attached SSD
- Both databases were given all available memory and used 8 clients concurrently
- 4,000 simulated devices generated 10 CPU metrics every 10 seconds for 3 full days
- Inserts resulted in 1 (hyper)table with 100M rows of data, with a second set of tests at 1B rows
- For TimescaleDB, we set the chunk size to 12 hours, resulting in 6 total chunks for our 100 million row dataset and 60 total chunks for our 1 billion row dataset
Inserts: 20x faster inserts at scale, constant even at billions of rows
Many engineers start using PostgreSQL to store their time-series data because of its query power and ease of use, but are eventually forced to migrate to some NoSQL system when they hit a certain scale. Some skip over PostgreSQL for time-series workloads completely because of these scaling problems.
But with TimescaleDB, users can scale to billions of rows on PostgreSQL, while maintaining high, constant insert rates. This also enables users to store their relational metadata and time-series together in the same database, query them together using time-series-optimized SQL, and continue to use their favorite tools and add-ons.
Consider this graph below, which shows a dataset scaled out to 1 billion rows (on a single machine) emulating a common monitoring scenario, with database clients inserting moderately-sized batches of data containing time, a device’s tag set, and multiple numeric metrics (in this case, 10).
PostgreSQL and Timescale start off with relatively even insert performance (~115K rows per second). As data volumes approach 100 million rows, PostgreSQL’s insert rate begins to rapidly decline. At 200 million rows the insert rate in PostgreSQL is an average of 30K rows per second and only gets worse; at 1 billion rows, it’s averaging 5K rows per second.
On the other hand, TimescaleDB sustains an average insert rate of 111K rows per second through 1 billion rows of data–a 20x improvement. We would have benchmarked further, but we’d be waiting on the Elephant for quite awhile: PostgreSQL took almost 40 hours to insert 1 billion rows of data, while TimescaleDB took less than 3 hours.
Whenever a new row of data is inserted into PostgreSQL, the database needs to update the indexes (e.g., B-trees) for each of the table’s indexed columns. Once the indexes are too large to fit in memory — which we find typically happens when your table is in the 10s of millions of rows, depending on your allocated memory — this requires swapping one or more pages in from disk. It is this disk swapping that creates the insert bottleneck. Throwing more memory at the problem only delays the inevitable.
TimescaleDB solves this through its heavily utilization and automation of time-space partitioning, even when running on a single machine. Essentially, all writes to recent time intervals are only to tables that remain in memory. This results in a consistent 20x insert performance improvement over PostgreSQL when inserting data at scale.
Queries: 1.2x to 14,000x faster time-based queries
Next, we looked at query performance.
In general, migrating to TimescaleDB from PostgreSQL impacts queries in one of three ways:
- Most simple queries (e.g., indexed lookups) that typically take <20ms, will be a few milliseconds slower on TimescaleDB, owing to the slightly larger planning time overhead.
- More complex queries that use time-based filtering or aggregations will be anywhere from 1.2x to 5x faster on TimescaleDB.
- Finally, queries where we can leverage time-ordering will be significantly faster, anywhere from 450x to more than 14,000x faster in our tests.
Note: All performance numbers shown below are from 1000 “warm” runs (to exclude the effects of disk caching, etc.) of each query type.
On single-disk machines, many simple queries that just perform indexed lookups or table scans will be 1–3 milliseconds slower on TimescaleDB. For example, in our 100M row table with indexed time, hostname, and cpu usage information, the following query (#1) will take less than 3ms for each database, but an extra 1ms on TimescaleDB:
Query 1 — A simple query
SELECT date_trunc('minute', time) AS minute, MAX(usage_user) FROM cpu WHERE hostname = 'host_731' AND time >= '2016-01-01 02:17:08.646325 -7:00' AND time < '2016-01-01 03:17:08.646325 -7:00' GROUP BY minute ORDER BY minute ASC;
Similarly, even if we increase the amount of time the query is looking up (query #2), TimescaleDB lags by a few milliseconds:
Query 2 — A simpler query over larger period of time
SELECT date_trunc('minute', time) AS minute, MAX(usage_user) FROM cpu WHERE hostname = 'host_731' AND time >= '2016-01-01 07:47:52' AND time < '2016-01-01 19:47:52' GROUP BY minute ORDER BY minute ASC
TimescaleDB incurs a penalty of a few milliseconds on planning time, due to its use of multiple tables (chunks) instead of one big table. While this is something we plan on improving in the future, the vast majority of our users so far have been content to give up the extra few milliseconds in exchange for the benefits on other, more complex time-based queries (below).
Time-based queries often achieve 1.2x-5x superior performance in TimescaleDB, such as this one below that gets high usage of a device when CPU is over 90 over a specific time-period:
Query 3 — Time-based filter
SELECT * FROM cpu WHERE usage_user > 90.0 AND time >= '2016-01-01 00:00:00' AND time < '2016-01-02 00:00:00'
Further, larger queries involving time-based aggregations (GROUP BYs), which are quite common in time-oriented analysis, will be even faster in TimescaleDB.
The following is a query (#4) that touches 33M rows and is 5x faster in TimescaleDB when the entire (hyper)table is 100M rows, and around 2x faster at 1B rows:
Query 4 — Time-based aggregation
SELECT date_trunc('hour', TIME) AS hour, hostname, avg(usage_user) AS mean_usage_user FROM cpu WHERE TIME >= '2016-01-01 00:00:00' AND TIME < '2016-01-02 00:00:00' GROUP BY hour, hostname ORDER BY hour
Thanks to TimescaleDB’s time-space partitioning, our dataset is naturally partitioned into chunks, each of which has its own indexes. This has several benefits when processing a query with a time predicate or a time-based aggregate. Using constraint exclusion on the various chunks, TimescaleDB can scan less data, allowing the database to choose more suitable plans: e.g., perform calculations fully in memory vs. spilling to disk, utilize smaller indexes (so can walk across fewer index values), use better algorithms like HashAggregates vs. GroupAggregates, etc.
Time-ordering based queries
Finally, queries that can reason specifically about time ordering can be much more performant in TimescaleDB, from 450x to more than 14,000x faster in our tests.
For the following query (#5), TimescaleDB introduces a time-based “merge append” optimization to minimize the number of groups which must be processed (given its knowledge that time is already ordered).
Query 5: Order by limit by using merge append
SELECT date_trunc('minute', time) AS minute, max(usage_user) FROM cpu WHERE time < '2016-01-01 19:47:52' GROUP BY minute ORDER BY minute DESC LIMIT 5
For our 100M row table, this results in query latency that is over 450x faster than PostgreSQL. And in another run of the same query on 1 billion rows of data, we saw an over 14,000x improvement! (This surprised even me and I ran the benchmarks.)
These queries benefit from optimizations introduced in TimescaleDB that help to speed up query response times where time ordering is known. For example, TimescaleDB optimizes PostgreSQL’s native merge append to bring these significant efficiencies to ordered time-based aggregates. Such aggregates appear quite regularly for time-series analysis (e.g., query #5), which involve GROUP BYs, ORDER BYs, and LIMITs. As such, even without a strict time-range specified by the user, the database will only process those minimal set of chunks and data needed to answer this query.
More optimizations for these types of complex, time-oriented queries are in the works, and we expect to see similar improvements in the future.
Data retention: 2000x faster deletes
Time-series data often builds up very quickly, necessitating data retention policies, such as “only store raw data for one week.” TimescaleDB provides data management tools that make implementing data retention policies easy and significantly more performant than PostgreSQL.
In fact, it is common to couple data retention policies with the use of aggregations, so one might keep two hypertables: one with raw data, the other with data rolled up into minutely or hourly (etc) aggregates. Then, one could define different retention policies on the two hypertables, e.g., storing the aggregated data for longer.
TimescaleDB allows efficient deletion of old data at the chunk level, rather than at the row level, via its drop_chunks() functionality.
SELECT drop_chunks(interval ‘7 days’, ‘machine_readings’);
In the data retention benchmark below:
- Chunks are sized to 12 hours
- 100 million rows contain 6 total chunks
- 1 billion rows contain 60 total chunks
- 5 chunks were dropped from TimescaleDB and 5 12 hour intervals of data were deleted from PostgreSQL.
- The data retention mechanisms used for each are TimescaleDB’s drop_chunks() vs. the DELETE command in PostgreSQL.
- In this scenario, we used SELECT drop_chunks(‘2016–01–01T12:00:00Z’::TIMESTAMPTZ, ‘cpu_ts’); and increased it every 12 hours for each run.
TimescaleDB’s drop_chunks deletes all chunks from the hypertable that only include data older than the specified duration. Because chunks are individual tables, the delete results in simply deleting a file from the file system, and is thus very fast, completing in 10s of milliseconds. Meanwhile, PostgreSQL takes on the order of minutes to complete, since it must delete individual rows of data within a table.
TimescaleDB’s approach also avoids fragmentation in the underlying database files, which in turn avoids the need for vacuuming that can be additionally expensive (i.e., time-consuming) in very large tables.
Ultimately, that is why TimescaleDB is over 2000x faster than PostgreSQL when it comes to deleting old data for data retention.
Ease of use: New functions for time-series manipulation
Lastly, TimescaleDB includes a number of time-oriented features that aren't found in traditional RDBMS (e.g., PostgreSQL). These include more passive query optimizations (e.g., the time-based merge append in query #5 above) but also special time-oriented functions that one might desire when working with time-series data yet find completely lacking in SQL.
In other words, TimescaleDB optimizes SQL for easier, more effective time-series manipulation.
Two of our most used functions are:
(1) time_bucket, a more powerful version of the standard date_trunc function that allows for arbitrary time intervals (e.g., 5 minutes, 6 hours, etc.), as well as flexible groupings and offsets, instead of just second, minute, hour, etc. and
(2) last / first aggregates (our bookend functions), which allow you to get the value of one column as ordered by another.
For further reading, see Time-oriented Analytics.
To illustrate the gains in simplicity and ease of use that these functions realize for SQL users, let’s compare various workarounds for time_bucket with an actual example of time_bucket.
Stack Overflow is often a good gauge of a developer’s wants in a particular technology, so instead of suggesting one workaround over another, let’s just look there:
- Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
- SQL to group time intervals by arbitrary time period
- What is the fastest way to truncate timestamps to 5 minutes in Postgres?
Here is one of those Stack Overflow workarounds for time-bucketing:
-- You can generate a table of "buckets" by adding intervals created by generate_series(). -- This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data. select (select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time, (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time from generate_series(0, (24*60), 5) n -- Wrap that statement in a common table expression, and you can join and group on it as if it were a base table. with five_min_intervals as ( select (select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time, (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time from generate_series(0, (24*60), 5) n ) select f.start_time, f.end_time, avg(m.val) avg_val from measurements m right join five_min_intervals f on m.measured_at >= f.start_time and m.measured_at < f.end_time group by f.start_time, f.end_time order by f.start_time
Now, compare that to TimescaleDB’s time_bucket function:
SELECT time_bucket('5 minutes', time) five_min, avg(cpu) FROM metrics GROUP BY five_min ORDER BY five_min DESC LIMIT 10;
The simplicity is apparent and in practice it reduces an enormous amount of mental friction (and saves development time!).
For the same comparison against last / first functions, see these two links:
In short, SQL is a powerful language that is widely used for good reason, yet in many ways it is sub-optimal in its current form for time-series data manipulation. TimescaleDB is introducing new time-oriented SQL functions so that any user of SQL can work with time-series data without having to abandon a reliable database and mature ecosystem they know and love for an obtuse query language, painful data management or data integrity issues.
PostgreSQL is a venerable and powerful database, yet it also lacks both functionality and performance when it comes to working with time-series data. Even so, many people use PostgreSQL to store time-series data or they use it alongside their NoSQL time-series database to store metadata.
Our goal at TimescaleDB is to build the best database for time-series-heavy applications, without complicating the stack. These applications are emerging in more and more places (e.g., IoT, logistics, finance, events), yet are insufficiently handled by existing solutions.
Up against PostgreSQL, TimescaleDB achieves 20x faster inserts at scale, 1.2x-14,000x faster time-based queries, 2000x faster deletes, and offers streamlined time-series functionality. So, if you are storing time-series data in PostgreSQL, there is little reason not to install TimescaleDB. And it’s quite easy to install, even right in your existing PostgreSQL instance.