Achieve automated materialized views using TimescaleDB continuous aggregates.
Time-series data provides significant value to organizations because it enables them to analyze important real-time and historical metrics. However, data is valuable only if it’s easy to access. That’s where being able to build dashboards that run repetitive analytical queries becomes a force multiplier for organizations looking to expose their time-series data across teams.
In this post, we will discuss why you should use TimescaleDB continuous aggregates to speed up queries that aggregate over time. Additionally, we will walk through a sample application that leverages this feature specific to TimescaleDB.
TL;DR Check out this Grafana dashboard that shows TimescaleDB continuous aggregates in action.
Why continuous aggregates?
Powering real-time dashboards is one of the top use cases for TimescaleDB. Dashboards are unique in that they are often pre-built and run repetitive queries. However, any database powering dashboards also needs to be able to support ad-hoc queries since this increases the agility of creating, updating, and adjusting dashboards.
PostgreSQL excels at these ad-hoc queries by offering the flexible SQL interface. However, as data sizes increase, PostgreSQL tends to fall over. That’s where TimescaleDB comes in!
With automated partitioning, TimescaleDB can optimize analytical time-series queries so that less data is read from disk. We’ve also built continuous aggregates, which further reduce disk throughput and compute requirements when running historical aggregate queries. By combining the inherent flexibility of SQL with the advanced capabilities of TimescaleDB, you can perform the ad-hoc queries needed to define your dashboards, as well as efficiently run repetitive analytical queries to power ongoing live dashboards.
Should I be using continuous aggregates?
We built continuous aggregates to significantly speed up repetitive queries that calculate aggregates over periods of time. It is not really meant for point queries, e.g. queries that search for values at a single or short period of time. Essentially, a continuous aggregates populates a materialized view that stores aggregates on a scheduled basis. By storing aggregates, it minimizes the computation required to run an aggregate query and the amount of data that needs to be read off disk.
TimescaleDB continuous aggregates do have some specific characteristics that you should be aware of as you decide if you want to use this feature.
- Continuous aggregates are scheduled. You define how up-to-date you want your aggregates to be. If you want the aggregates to be updated at the same time a value is written into the original hypertable, you’ll end up incurring write amplification at the time of insert. This will result in lower inserts per second. We typically recommend users looking for fast insert performance to allow aggregates to lag behind the bleeding insert edge.
- Continuous aggregates support out-of-order inserts. Some systems that implement continuous aggregates actually ignore out of order inserts. In TimescaleDB, when the scheduled job runs to update the continuous aggregate, we also check for any out of order inserts and update those associated aggregates.
- The continuous aggregate itself is actually a hypertable. You don’t have to worry about how to scale that particular table.
Walking through a sample application leveraging continuous aggregates
Let’s walk through how continuous aggregates work be exploring a sample application. We first introduced this application in our time_bucket() analytics post, where we showed how you could build flexible graphs using TimescaleDB’s time_bucket() function.
The sample application is written in Python. It essentially scrapes the Open AQ (air quality) API, parses the results, and stores all measurements collected from air quality sensors for all cities in Great Britain. You can check out the code here.
The Grafana dashboard located here describes step-by-step how to set up multiple continuous aggregates. To give you a quick sense of why continuous aggregates are really useful, here’s a query that we use in the Air Quality example:
SELECT time_bucket('1 day', time) as bucket, parameter_id, avg(value) as avg, max(value) as max, min(value) as min FROM measurements GROUP BY bucket, parameter_id;
You’ll notice that we are querying across all time and bucketing things by 1 day intervals. The computations of bucketing things by 1 day, as well as reading all that data off disk is high. The table at the point at which I’m running this query has ~700k rows. You can imagine how this query could get really slow as you add more data.
Below is an example query plan. Notice how much work is required to compute this query!
To speed this up, I wrote a continuous aggregate.
Now, let’s try to query the same thing, but query the measurements_daily table directly. The query plan is greatly simplified in that it is no longer scanning as much data.
Advanced configurations and toggles
When we designed continuous aggregates, we wanted users to be able to toggle and configure the characteristics that they wanted to achieve from continuous aggregates. You most likely won’t need to touch a lot of the toggles, but here are a couple that you should definitely be aware of.
timescaledb.refresh_interval: This configuration specifies how often you want a continuous aggregate to run. Note that when a continuous aggregate is run, it only runs on new data that it hasn’t observed before. This includes any data that came in out of order that invalidates older continuous aggregates. The value you choose here impacts how hard your background workers are working in order to calculate continuous aggregates.
timescaledb.refresh_lag: This configuration specifies how far behind a continuous aggregate runs compared to the insert bleeding edge. This impacts write amplification. If you set the refresh_lag to the negative bucket width, the continuous aggregate will immediately calculate and update the underlying materialized view whenever new data is received. However, this also means that you are executing multiple writes on insert, which impacts insert performance.
timescaledb.max_interval_per_job: Some of our users using continuous aggregates for the first time notice that their aggregates never populate. This is a situation where you want to take a look at this value - it controls how long a query can run in the background before we kill it. If you take a huge table and try to create a continuous aggregate on it for the first time, it’s likely that you’ll hit this barrier.
The best way to check if your configurations make sense is to use
timescaledb_information.continuous_aggregate_stats. These stats will show you important statistics about your continuous aggregates.
As you can see, continuous aggregates can be an extremely useful function when you are looking to further reduce disk throughput and compute requirements when running historical aggregate queries.
Need help along the way? You can always reach out to us on our community Slack channel.