Category: All posts
Feb 16, 2024
Posted by
Ryan Booz
Compared to PostgreSQL alone, TimescaleDB can dramatically improve query performance by 1,000x or more, reduce storage utilization by 90 %, and provide features essential for time-series and analytical applications. Some of these features even benefit non-time-series data–increasing query performance just by loading the extension.
PostgreSQL is today’s most advanced and most popular open-source relational database. We believe this as much today as we did five years ago when we chose PostgreSQL as the foundation of TimescaleDB because of its longevity, extensibility, and rock-solid architecture.
By loading the TimescaleDB extension into a PostgreSQL database, you can effectively “supercharge” PostgreSQL, empowering it to excel for both time-series workloads and classic transactional ones.
This article highlights how TimescaleDB improves PostgreSQL query performance at scale, increases storage efficiency (thus lowering costs), and provides developers with the tools necessary for building modern, innovative, and cost-effective time-series applications—all while retaining access to the full Postgres feature set and ecosystem.
(To show our work, this article also presents the benchmarks that compare query performance and data ingestion for one billion rows of time-series data between PostgreSQL 14.4 and TimescaleDB 2.7.2. For PostgreSQL, we benchmarked both using a single-table and declarative partitioning)
With orders of magnitude better performance at scale, TimescaleDB enables developers to build on top of PostgreSQL and “future-proof” their applications.
The core concept in TimescaleDB is the notion of the “hypertable”: seamless partitioning of data while presenting the abstraction of a single, virtual table across all your data.
This partitioning enables faster queries by quickly excluding irrelevant data, as well as enabling enhancements to the query planner and execution process. In this way, a hypertable looks and feels just like a normal PostgreSQL table but enables a lot more.
For example, one recent query planner improvement excludes data more efficiently for relative now()
-based queries (e.g., WHERE time >= now()-’1 week’::interval
). To be even more specific, these relative time predicates are constified at planning time to ignore chunks that don't have data to satisfy the query. Furthermore, as the number of partitions increases, planning times can be reduced by 100x or more over vanilla PostgreSQL for the same number of partitions.
When hypertables are compressed, the amount of data that queries need to read is reduced, leading to dramatic increases in performance of 1000x or more. For more information (including a discussion of this bar chart), keep reading the benchmark below.
Other enhancements in TimescaleDB apply to both hypertables and normal PostgreSQL tables, e..g, SkipScan, which dramatically improves DISTINCT queries on any PostgreSQL table with a matching B-tree index regardless of whether you have time-series data or not.
Today, nearly every time-series application reaches for rolling aggregations to query and analyze data more efficiently. The raw data could be saved per second, minute, or hour (and a plethora of other permutations in between), but what most applications display are time-based aggregates.
What's more, most time-series data applications are append-only, which means that aggregate queries return the same values over and over based on the unchanged raw data. It's much more efficient to store the results of the aggregate query and use those for analytic reporting and analysis most of the time.
Often, developers try materialized views in vanilla PostgreSQL to help, however, they have two main problems with fast-changing time-series data:
In contrast, TimescaleDB’s continuous aggregates solve both of these problems. They are updated automatically on the schedule you configure, they can have data retention policies applied separately from the underlying hypertable, and they only update the portions of new data that have been modified since the last materialization was run.
When we compare using a continuous aggregate to querying the data directly, customers often see queries that might take minutes or even hours drop to milliseconds. When that query is powering a dashboard or a web page, this can be the difference between snappy and unusable.
The number one driver of cost for modern time-series applications is storage. Even when storage is cheap, time-series data piles up quickly. Timescale provides two methods to reduce the amount of data being stored, compression and downsampling using continuous aggregates.
The Timescale hypertable is data heavily partitioned into many, many smaller partitions called “chunks.” Timescale provides native columnar compression on this per-chunk basis.
As we show in the benchmark results (and as we see often in production databases), compression reduced disk consumption by over 90% compared to the same data in vanilla PostgreSQL.
Even better, TimescaleDB doesn't change anything about the PostgreSQL storage system to achieve this level of compression. Instead, TimescaleDB utilizes PostgreSQL storage features, namely TOAST, to transition historical data from row-store to column-store, a key component for querying long-term aggregates over individual columns.
To demonstrate the effectiveness of compression, here’s a comparison of the total size of the CPU table and indexes in TimescaleDB and in PostgreSQL.
With the proper compression policy in place, hypertable chunks will be compressed automatically once all data in the chunk has aged beyond the specified time interval.
In practice, this means that a hypertable can store data as row-oriented for newer data and column-oriented for older data simultaneously. Having the data stored as both row and column store also matches the typical query patterns of time-series applications to help improve overall query performance—again, something we see in the benchmark results.
This reduces the storage footprint and improves query performance even further for many time-series aggregate queries. Compression is also automatic: users set a compression horizon, and then data is automatically compressed as it ages.
This also means that users can save significant costs using cloud services that provide separation of compute and storage—such as Timescale—so that larger machines aren’t needed just for more storage.
With TimescaleDB, automated data retention is achieved with one SQL command:
SELECT add_retention_policy('cpu', INTERVAL '7 days');
There's no further setup or extra extensions to install or configure. Each day any partitions older than 7 days will be dropped automatically. If you were to implement this in vanilla PostgreSQL you’d need to use DELETE to remove records, which is a very costly operation as it needs to scan for the data to remove. Even if you were using PostgreSQL declarative partitioning, you’d still need to automate the process yourself, wasting precious developer time, adding additional requirements, and implementing bespoke code that needs to be supported moving forward.
One can also combine continuous aggregates and data retention policies to downsample data and then drop the raw measurements, thus saving even more data storage.
Using this architecture, you can retain higher-level rollup values for a longer period of time, even after the raw data has been dropped from the database. This allows multiple different levels of granularity to be stored in the database, and provides even more ways to control storage costs.
TimescaleDB includes more features that speed up development time. This includes a library of over 100 hyperfunctions, which make complex time-series analysis easy using SQL, such as count approximations, statistical aggregates, and more. TimescaleDB also includes a built-in, multi-purpose job scheduling engine for setting up automated workflows.
Timescale hyperfunctions make data analysis in SQL easy. This library includes time-weighted averages, last observation carried forward, and downsampling with LTTP or ASAP algorithms, time_bucket(), and time_bucket_gapfill().
As an example, one could get the average temperature every day for each device over the last seven days, carrying forward the last value for missing readings with the following SQL.
SELECT
time_bucket_gapfill('1 day', time) AS day,
device_id,
avg(temperature) AS value,
locf(avg(temperature))
FROM metrics
WHERE time > now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;
For more information on the extensive list of hyperfunctions in TimescaleDB, please visit our API documentation.
TimescaleDB provides the ability to schedule the execution of custom stored procedures with user-defined actions. This feature provides access to the same job scheduler that TimescaleDB uses to run all of the native automation jobs for compression, continuous aggregates, data retention, and more.
This provides a similar functionality as a third-party scheduler likepg_cron
without needing to maintain multiple PostgreSQL extensions or databases.
We see users doing all sorts of neat stuff with user-defined actions, from calculating complex SLAs to sending event emails based on data correctness to polling tables.
Notably, because TimescaleDB is packaged as a PostgreSQL extension, it achieves these results without forking or breaking PostgreSQL.
Postgres is popular at the moment, but a lot of that popularity is with ‘Postgres compatible’ products which might look like Postgres, or talk like Postgres, or query somewhat like Postgres - but aren’t Postgres under the hood (and are sometimes closed-source).
TimescaleDB is just PostgreSQL. One can install other extensions, make full use of the type system, and benefit from the incredibly diverse Postgres ecosystem.
Any product that can connect to PostgreSQL can query time-series data stored with TimescaleDB using the same SQL it normally would. While we provide helper functions for working with data, we do not restrict the SQL features one can use. Once in the database, users can combine time series and business data as necessary.
PostgreSQL is not a new database: it has years of production deployments under its belt. High availability, backup and restore, and load-balancing are all solved problems. As we mentioned earlier, we chose Postgres because it was reliable, and TimescaleDB inherits that reliability.
This section provides details about how we tested TimescaleDB against vanilla PostgreSQL. Feel free to download the Time-Series Benchmarking Suite and run it for yourself. If you'd like to get started with TimescaleDB quickly, you can use Timescale, which lets you sign up for a free 30-day trial.
For this benchmark, all tests were run on the same m5.2xlarge EC2 instance in AWS us-east-1 with the following configuration and software versions.
We also deliberately chose to use EBS (elastic block storage) volumes rather than attached SSDs. While benchmark performance would certainly improve with SSDs, the baseline performance using EBS is illustrative of what many self-hosted users could expect while saving some expenses by using elastic storage.
We ran only one PostgreSQL cluster on the EC2 database instance. The TimescaleDB extension was loaded via shared_preload_libraries
but not installed into the PostgreSQL-only database.
To set sane defaults for the PostgreSQL cluster, we ran timescaledb-tune
and setsynchronous_commit=off
in postgresql.conf. This is a common performance configuration for write-heavy workloads while still maintaining transactional, logged integrity. All configuration changes applied to both PostgreSQL and TimescaleDB benchmarks alike.
As we mentioned earlier, for this benchmark, we used the Time-Series Benchmarking Suite and generated data for 4,000 devices, recording metrics every 10 seconds, for one month. This generated just over one billion rows of data. Because TimescaleDB is a PostgreSQL extension, we could use the same data file and ingestion process, ensuring identical data in each database.
TimescaleDB uses an abstraction called hypertables which splits large tables into smaller chunks, increasing performance and greatly easing management of large amounts of time-series data.
We also enabled native compression on TimescaleDB. We compressed everything but the most recent chunk of data, leaving it uncompressed. This configuration is a commonly recommended one where raw, uncompressed data is kept for recent time periods and older data is compressed, enabling greater query efficiency. The parameters we used to enable compression are as follows: we segmented by the tags_id
columns and ordered by time descending and usage_user
columns.
All benchmark results were performed on a single PostgreSQL table and on an empty TimescaleDB hypertable created with four-hour chunks.
(And for those thinking that we also need to compare TimescaleDB with PostgreSQL Declarative Partitioning, please read on to the end; we discuss that as well.)
For this benchmark, we inserted one billion rows of data and then ran a set of queries 100 times each against the respective database. The data, indexes, and queries are exactly the same for both databases. The only difference is that the TimescaleDB queries use the time_bucket()
function for doing arbitrary interval bucketing, whereas the PostgreSQL queries use the new date_bin()
function, introduced in PostgreSQL 13.
The results are clear and consistently reproducible. For one billion rows of data spanning one month of time (with four-hour partitions), TimescaleDB consistently outperformed a vanilla PostgreSQL database running 100 queries at a time.
There are two main reasons for Timescale's consistent query performance.
In PostgreSQL (and many other databases), table data is stored in an 8 Kb page (sometimes called a block). If a query has to read 1,000 pages to satisfy it, it reads ~8 Mb of data. If some of that data had to be retrieved from disk, then the query will usually be slower than if all of the data was found in memory (the reserved space known as shared buffers in PostgreSQL, if you’re looking for some insight into PostgreSQL caching we have a blog on that).
With TimescaleDB compression, queries that return the same results have to read significantly fewer pages of data (this is both because of the actual compression and because it can return single columns rather than whole rows). For all of our benchmarking queries, this also translates into higher concurrency for the benchmark duration.
Stated another way, compression typically impacts fetching historical data most because TimescaleDB can query individual columns rather than entire rows. Because less I/O is occurring for each query, TimescaleDB can handle more queries with a lower standard deviation than vanilla PostgreSQL.
Let's look at two examples of how this plays out between the two databases using two queries above, cpu-max-all-1
and single-groupby-1-1-12
.
single-groupby-1-1-12
We selected one of the queries from the benchmark and ran it on both databases. Recall that each database has the exact same data and indexes on uncompressed data. TimescaleDB has the advantage of being able to segment and order compressed data in a way that's beneficial to typical application queries.
EXPLAIN (ANALYZE,BUFFERS)
SELECT time_bucket('1 minute', time) AS minute,
max(usage_user) as max_usage_user
FROM cpu
WHERE tags_id IN (
SELECT id FROM tags WHERE hostname IN ('host_249')
)
AND time >= '2022-08-03 06:16:22.646325 +0000'
AND time < '2022-08-03 18:16:22.646325 +0000'
GROUP BY minute ORDER BY minute;
When we run the EXPLAIN
on this query and ask for BUFFERS
to be returned, we start to get a hint of what's happening.
Two things quickly jump out when I view these results. First, the execution times are significantly lower than the benchmarking results above. Individually, these queries execute pretty fast, but PostgreSQL has to read approximately 27x more data to satisfy the query. When 16 workers request data across the time range, PostgreSQL has to do a lot more I/O, which consumes resources. TimescaleDB can simply handle a higher concurrency for the same workload. We can see this in the full benchmarking output, too.
cpu-max-all-1
Again we can clearly see the impact of compression on the ability for TimescaleDB to handle a higher concurrent load when compared to vanilla PostgreSQL for time-series queries.
EXPLAIN (ANALYZE, buffers)
SELECT
time_bucket('3600 seconds', time) AS hour,
max(usage_user) AS max_usage_user,
max(usage_system) AS max_usage_system,
max(usage_idle) AS max_usage_idle,
max(usage_nice) AS max_usage_nice,
max(usage_iowait) AS max_usage_iowait,
max(usage_irq) AS max_usage_irq,
max(usage_softirq) AS max_usage_softirq,
max(usage_steal) AS max_usage_steal,
max(usage_guest) AS max_usage_guest,
max(usage_guest_nice) AS max_usage_guest_nice
FROM cpu
WHERE
tags_id IN (
SELECT id FROM tags WHERE hostname IN ('host_249')
)
AND time >= '2022-08-08 18:16:22.646325 +0000'
AND time < '2022-08-09 02:16:22.646325 +0000'
GROUP BY HOUR
ORDER BY HOUR;
With compression, TimescaleDB does significantly less work to retrieve the same data, resulting in faster queries and higher query concurrency.
TimescaleDB hypertables require a time column to partition the data. Because time is an essential (and known) part of each row and chunk, TimescaleDB can intelligently improve how the query is planned and executed to take advantage of the time component of the data.
For example, let's query for the maximum CPU usage for each minute for the last 10 minutes.
EXPLAIN (ANALYZE,BUFFERS)
SELECT time_bucket('1 minute', time) AS minute,
max(usage_user)
FROM cpu
WHERE time > '2022-08-14 07:12:17.568901 +0000'
GROUP BY minute
ORDER BY minute DESC
LIMIT 10;
Because TimescaleDB understands that this query is aggregating on time and the result is ordered by the time column (something each chunk is already ordering by in an index), it can use the ChunkAppend custom execution node. In contrast, PostgreSQL plans five workers to scan all partitions before sorting the results and finally doing a GroupAggregate
on the time column.
TimescaleDB scans fewer data and doesn't need to spend time re-sorting the data that it knows is already sorted in the chunk. For time-series data with a known order and constraints, TimescaleDB works better for most queries than vanilla PostgreSQL.
Intriguingly, ingest performance for both TimescaleDB and PostgreSQL are nearly identical, a dramatic improvement for PostgreSQL given the results five years ago with PostgreSQL 9.6. However, TimscaleDB still consistently finished with an average rate of 3,000 to 4,000 rows/second higher than a single PostgreSQL table.
This shows that while vast improvements have been made in PostgreSQL, Timescale hypertables also continue to perform exceptionally well. As well as the rate, the other characteristics of ingest performance are nearly identical between TimescaleDB and PostgreSQL. Modifying the batch size for the number of rows to insert at a time impacts each database the same: small batch sizes or a few hundred rows significantly hinder ingest performance, while batch sizes of 10,000 to 15,000 rows seem to be about optimal for this dataset.
In the benchmarks above, we tested TimescaleDB against a single PostgreSQL table simply because that’s the default option that most people end up using. PostgreSQL also has support for native declarative partitioning, which has also been maturing over the past few years.
For the sake of completeness, we also tested TimescaleDB against native declarative partitioning. As the graphic below shows, TimescaleDB is still 1,000x faster for some queries, with strong performance gains still showing across the board. Ingest performance was similar between TimescaleDB and declarative partitioning.
In fact, if anything, the takeaway from these tests was that while declarative partitioning has matured, the gap between using a single table and declarative partitioning has shrunk.
Using declarative partitioning is also harder. One needs to manually pre-create partitions, ensure there are no data gaps, ensure no data is inserted outside of your partition ranges, and create more partitions as time moves on.
In contrast, with TimescaleDB, one does not need any of this. Instead, a single create_hypertable
command is used to convert a standard table into a hypertable, and TimescaleDB takes care of the rest.
TimescaleDB harnesses the power of the extension framework to supercharge PostgreSQL for time-series and analytical applications. With additional features like compression and continuous aggregates, TimescaleDB provides not only the most performant way of using time-series data in PostgreSQL but also the best developer experience.
When compared to traditional PostgreSQL, TimescaleDB enables 1,000x faster time-series queries, compresses data by 90 %, and provides access to advanced time-series analysis tools and operational features specifically designed to ease data management. TimescaleDB also provides benefits for other types of queries with features like SkipScan—just by installing the extension.
In short, TimescaleDB extends PostgreSQL to enable developers to continue to use the database they love for time series, perform better at scale, spend less, and stream data analysis and operations.
If you’re looking to expand your database scalability, try our hosted service, Timescale. You will get the PostgreSQL you know and love with extra features for time series (continuous aggregation, compression, automatic retention policies, hyperfunctions). Plus, a platform with automated backups, high availability, automatic upgrades, and much more. You can use it for free for 30 days; no credit card required.