A different and (often) better way to downsample your Prometheus metrics

A different and (often) better way to downsample your Prometheus metrics

Observability is the ability to measure a system’s state based on the data it generates. To be effective, observability tools first have to be able to ingest data about the system from a wide variety of sources, typically in the form of metrics, traces, logs and metadata. Second, they must offer powerful, flexible and fast capabilities to analyze and correlate all that data to understand the health and performance of the system and identify issues and areas for improvement.

Promscale is an easy to use observability backend, built on top of TimescaleDB. Our vision is to enable engineers to store all their observability data - metrics, traces, logs and metadata - in a single mature and scalable store, and analyze data through a unified and complete SQL interface. Earlier this month, as part of our #AlwaysBeLaunching series of monthly launches, we launched support for traces in Promscale, enabling developers to interrogate their trace data and unlock new insights to help them identify problems and potential optimizations in their microservices environments in a way not possible with other open-source tracing tools. Promscale also supports storing and querying (in SQL and PromQL) metric data from Prometheus, the de-facto monitoring standard for modern cloud-native environments. Today, we’re excited to introduce a better option for downsampling Prometheus metrics, enabling developers to do accurate and flexible trend analysis on those metrics over long periods of time with high performance and reduced storage costs. Prometheus downsampling leverages continuous aggregates, one of the most popular and powerful features of TimescaleDB.

Metric monitoring is a key pillar of any observability stack used to operate micro-service-based systems running on Kubernetes. Those systems are very dynamic with frequent deployments, vertical and horizontal autoscaling of pods or automatic provisioning of new nodes in the cluster and are made of lots of individual components. Prometheus is a great fit for those environments thanks to its straightforward auto-discovery mechanism of components to monitor and support for dimensional metrics (i.e., metrics with tags).

One thing is certainly true of Prometheus metric data: there is a lot of it. Each component in a Kubernetes-based system (and there are many of them) emits lots of metrics that are collected regularly (every 1 minute by default) and stored by Prometheus. As an example, just the node exporter which is used to monitor hosts emits hundreds of different metric data points (aka samples) for each collection period. Hundreds of thousands (even millions) of samples per second are fairly common for production environments which are expensive to store for long periods of time. Yet, as that data ages, individual samples become less important and we care more about the general trends and aggregates. For example, we would want to have real-time high resolution access to metrics like the number of API requests customers make to our application to detect sudden changes in it as an indication of a potential problem or the need to scale our systems. We would also want to use the same metric to understand how adoption of our API is growing over time in which case we’ll need to query data over long periods of time (a year for example) where resolution is not all that important (one data point per day per customer would be enough). How do we make the query to understand API adoption fast and reduce storage costs by only keeping the data at the resolution we need? We use downsampling.

What is downsampling?

Downsampling is the ability to reduce the rate of a signal. As a result, the resolution of the data is reduced and also its size. The main reasons why this is done are cost and performance. Storing the data becomes cheaper and querying the data is faster as the size of the data decreases.

The easiest form of downsampling is to collect fewer data points. With Prometheus this could be achieved by increasing the metric scraping interval (i.e., decreasing how often Prometheus collects metrics) at the cost of less visibility into metric changes between scrapes. However, as explained above, this is not typically what we want. In observability, the value of data diminishes with its age. We want very high resolution for our more recent data while it’s perfectly fine for old data to have much lower granularity, so it’s cheaper to store and faster to query.

A more sophisticated form of downsampling is to summarize and aggregate individual data points, often by bucketing data by time (i.e., hours, days, weeks, etc.). Summarizing data in this way reduces the amount of data that needs to be processed and stored. Therefore it improves the performance of queries for aggregate statistics over longer time spans and allows users to keep information about key features of their data for longer at a reasonable cost. In observability, the individual high-resolution samples are also kept but typically for a much shorter period of time since they are mainly used for troubleshooting issues right after they occur.

Downsampling with Prometheus

In the Prometheus ecosystem, downsampling is usually done through recording rules. These rules operate on a fairly simple mechanism: on a regular, scheduled basis the rules engine will run a set of user-configured queries on the data that came in since the rule was last run and will write the query results to another configured metric. So if we have a metric called api_requests_total we can define a metric rule for a new metric called customer:api_requests:rate1day and configure the rules engine to calculate the daily rate of API requests every hour and write the result to the new metric. The rules file would look like the following:

groups:
  - name: daily_stats
    interval: 1h
    rules:
    - record: customer:api_requests:rate1day
      expr: sum by (customer) (rate(api_requests_total[1d]))

When it comes to querying the data we will run a PromQL query against the new aggregated metric. For example to see the evolution in the number of API calls per day by customer we could run the following query:

customer:api_requests:rate1day

Most of the Prometheus ecosystem, including Promscale, supports downsampling using recording rules. While recording rules provide an easy-to-use mechanism to speed up long-term queries, they have some important limitations:

  1. Data is delayed by the time needed to aggregate it. If the resolution of our recording rule is 1 hour, queries on that metric will not include data since the last aggregation (anywhere between 0 and 60 minutes).
  2. They don’t necessarily help with reducing storage costs. Promscale does provide the ability to configure a default retention and then per metric retention overrides but many other storage systems for Prometheus don’t provide the ability to configure different retention policies for different metrics. For example, neither Thanos nor Cortex offer that capability.
  3. They specify a particular resolution but if we want to aggregate larger buckets of data we can end up with inaccurate results (e.g. average of averages, aggregating histograms)
  4. By default, they are only applied to data points received after the recording rule is created and require an additional manual step to backfill data.

Introducing Downsampling with Promscale

Today we are announcing the beta release of an additional downsampling method in Promscale called continuous aggregation that is more timely and accurate than recording rules in many circumstances. Combined, these two methods cover the majority of downsampling use-cases.

Read on to learn more about continuous aggregates in Promscale, how to set them up, how to query them and how to decide when to use continuous aggregates and when to use Prometheus recording rules.

To get started right away:

And, if these are the types of challenges you’d like to help solve, we are hiring (see all roles)!

Benefits of continuous aggregates

Promscale continuous aggregates leverage a TimescaleDB feature of the same name to have the database manage data materialization and downsampling for us. This mechanism improves on some aspects of recording rules but is not always appropriate. Combined, recording rules and continuous aggregates, cover a large portion of the use-cases we have seen.

Continuous aggregates address the following limitations of recording rules:

  • Timeliness. With recording rules, users only see the results of the query once the rules engine has run the materialization but not as soon as data comes in. This might not be such a big deal for 5-minute aggregates (although it could be) but for hourly or daily aggregates it could be a significant limitation. Continuous aggregates have a feature called real-time aggregates where the database automatically combines the materialized results with a query over the newest not-yet-materialized data to give us an accurate up-to-the-second view of our data.
  • Rollups. Downsampling is defined for particular time-bucket granularities (e.g. 5 minutes). But, when performing analysis, we may want to look at longer aggregates (e.g.  1 hour). With recording rules this is sometimes possible (a minimum of many minimums is the same as the minimum of the samples) but often it isn’t (the median of many medians is not the same as the median of the underlying samples). Continuous aggregates solve this by storing the intermediate state of an aggregate in the materialization, making further rollups possible. Read more about the way we define aggregates in our previous blog post.
  • Query flexibility for retrospective analysis. Once a query for a recording rule is defined, the resulting metric is sufficient to answer only that one query. However, when using continuous aggregates, we can use multi-purpose aggregates. For instance, Timescale’s toolkit extension has aggregates that support percentile queries on any percentile, and statistical aggregates supporting multiple summary aggregates. The aggregates that we define when we configure the materialization are much more flexible in what data we can derive at query time.
  • Backfilling. Prometheus recording rules only downsample data collected after the recording rule is created. The Prometheus community created a tool to backfill data but requires an additional manual step and has a number of limitations that make it more complex to use on a regular basis or to automate the process. Continuous aggregates automatically downsample all data available including past data so that we can start benefiting from the performance improvements the aggregated metric brings as soon as it is created.

Downsampling using continuous aggregates in Promscale

To downsample data, first we need to have some raw data. In Promscale, each metric is stored in a hypertable which contains the following columns:

  1. time column, which stores the timestamp of the reading.
  2. value column, which stores the sample reading as a float.
  3. series_id column, which stores a foreign key to the table that defines the series (label set) of the reading.

This corresponds to the Prometheus data model.

Let's imagine we have some metric called node_memory_MemFree. We can create a continuous aggregate to derive some summary statistics (min, max, average) about the reading on an hourly basis. To do it we will run the following query on the underlying TimescaleDB database which requires using any tool that can connect to PostgreSQL and execute queries like psql.

CREATE MATERIALIZED VIEW node_memfree_1hour
WITH (timescaledb.continuous) AS
  SELECT 
        timezone('UTC', 
          time_bucket('1 hour', time) AT TIME ZONE 'UTC' +'1 hour')  
            as time, 
        series_id,
        min(value) as min, 
        max(value) as max, 
        avg(value) as avg
    FROM prom_data.node_memory_MemFree
    GROUP BY time_bucket('1 hour', time), series_id;

Note: we add 1 hour to time_bucket to match the PromQL semantics of representing a bucket with the timestamp at the end of the bucket instead of the start of the bucket.

For more information on continuous aggregates and all their options, refer to the documentation. This continuous aggregate can now be queried via SQL and we can also make it available to PromQL queries (as a reminder, Promscale is 100% PromQL compliant). To do this we have to register it with Promscale as a PromQL metric view:

SELECT register_metric_view('public', 'node_memfree_1hour');

The first argument is the PostgreSQL schema that contains the continuous aggregate (created in the "public" schema by default) and the second is the name of the view. The name of the view becomes the name of the new metric.

Now, we can treat this data as a regular metric in both SQL and PromQL.

Querying the data

Promscale currently offers two distinct ways of querying data: PromQL and SQL. In this blog post we will primarily use PromQL but we will also show the equivalent SQL queries you would use in a dashboarding tool like Grafana (for more on using Grafana with TimescaleDB and Promscale, see our Guide to Grafana youtube series). Note that the raw query results from the PromQL and SQL queries would be formatted differently but they would display the same when using a Time series chart in Grafana.

The new aggregated metric is queried like any other Prometheus metric in Promscale.

PromQL:

node_memfree_1hour{__column__="avg"}

SQL:

SELECT time, jsonb(labels) as metric, avg
FROM node_memfree_1hour m
INNER JOIN prom_series.node_memory_MemFree s 
    ON (m.series_id=s.series_id)
ORDER BY time asc

Note: typically you will want to query a certain time window, not return all the data. With PromQL the time window must be defined as part of the query_range API call, not the query while in SQL it is specified in the query itself via a WHERE clause. For example, the SQL for querying the last 24 hours of data would be

SELECT time, jsonb(labels) as metric, avg
FROM node_memfree_1hour m
INNER JOIN prom_series.node_memory_MemFree s 
    ON (m.series_id=s.series_id)
WHERE time > NOW() - INTERVAL '24 hours'
ORDER BY time asc

When using SQL in Grafana, in order for the query to use the time window selected in the time picker the WHERE clause would be WHERE $__timeFilter(time).

In PromQL, the special __column__ label in the query specifies which column of the view will be returned. As we saw when defining the continuous aggregate, we can define multiple statistical aggregates in the same continuous aggregate as different columns. The __column__ selector specifies which one of those columns to return. By default, the column named value is used.

What this PromQL query will return is all the series with their average aggregates per hourly time buckets.

{
   "status" : "success",
   "data" : {
      "resultType" : "matrix",
      "result" : [
         {
            "metric" : {
               "__name__" : "cpu_usage_1hour",
               "__schema__" : "public",
               "__column__" : "avg",
               "node" : "prometheus",
               "instance" : "xyz"
            },
            "values" : [
               [ 1628146800, "15.98" ],
               [ 1628150400, "16.02" ],
               [ 1628154000, "16.05" ]
            ]
         },
         ...

The results return the hourly average over the node_memfree metric. As we can see, the results contain some special labels. The __schema__ label gives us the schema of the metric view and the __column__ label gives the column of the continuous aggregate which we are querying.

Note that, thanks to real-time aggregates, the results will return the latest data as well, even if it has not yet been materialized.

Taking advantage of TimescaleDB hyperfunctions

TimescaleDB hyperfunctions , a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code, provide several advanced aggregates that may be of special interest to Promscale users. For example, there is a hyperfunction to calculate approximate percentiles over time. We may want to see the 1st percentile of free memory to find nodes that underutilize memory (if the 1st percentile is high, that means the machine has a lot of free memory most of the time). To do this, we could get the 1st percentile of free memory by defining an aggregate like:

CREATE MATERIALIZED VIEW node_memfree_30m_aggregate
WITH (timescaledb.continuous)
AS SELECT
    time_bucket('30 min', time) as bucket,
    series_id,
    percentile_agg(value) as pct_agg
FROM prom_data.node_memory_MemFree
GROUP BY time_bucket('30 min', time), series_id;

The aggregate contains a sketch that can be used to answer the percentile query for any percentile.  For example, to create a view showing the first and fifth percentile:

CREATE VIEW node_memfree_30m AS 
SELECT
    bucket + '30 min' as time,
    series_id,
    approx_percentile(0.01, pct_agg ) as first,
    approx_percentile(0.05, pct_agg ) as fifth 
FROM node_memfree_30m_aggregate;

SELECT register_metric_view('public', 'node_memfree_30m');

We can now perform queries such as

PromQL:

node_memfree_30m{__column__="first"}

SQL:

SELECT time, jsonb(labels) as metric, first
FROM node_memfree_30m m
INNER JOIN prom_series.node_memory_MemFree s 
    ON (m.series_id=s.series_id)
ORDER BY time asc

You might ask why add the complexity of creating two views for the same aggregate? The answer is that this allows us to change the accessors we expose to PromQL without having to recalculate the materialization. For example, we could add a median (50% percentile) column by changing the node_memfee_30m view without changing the node_memfree_30m_aggregate materialization. In addition, we can derive more coarse-grained aggregations using fine-grained materializations: for example, we can create a 1-hour view based on the 30-minute materialization and have the results be accurate.

For more information about two-step aggregation, and why we use it, see our blog post on the topic.

Aggregating counters

TimescaleDB hyperfunctions also contain a counter aggregate hyperfunction that is able to store information for Prometheus-style resetting counters. This aggregate is able to derive all the counter-specific Prometheus functions: rate, irate, increase and resets:

CREATE MATERIALIZED VIEW cpu_usage_30m_aggregate
WITH (timescaledb.continuous)
AS SELECT
    time_bucket('30 min', time) as bucket,
    series_id,
    counter_agg(time, value, time_bucket_range('30 min', time)) as cnt_agg
FROM prom_data.cpu_usage
GROUP BY time_bucket('30 min', time), series_id;
CREATE OR REPLACE VIEW cpu_usage_30m AS
SELECT
    bucket + '30 min' as time,
    series_id,
    extrapolated_delta(cnt_agg, method =>'prometheus') as increase,
    extrapolated_rate(cnt_agg, method => 'prometheus') as rate,
    irate_right(cnt_agg) as irate,
    num_resets(cnt_agg)::float as resets
FROM cpu_usage_30m_aggregate;

SELECT prom_api.register_metric_view('public', 'cpu_usage_30m');

We can now perform queries such as

PromQL:

cpu_usage_30m{__column__="rate"}

SQL:

SELECT time, jsonb(labels) as metric, rate
FROM cpu_usage_30m m
INNER JOIN prom_series.cpu_usage s ON (m.series_id=s.series_id)
ORDER BY time asc;

To make it easier to create and manage continuous aggregates, we plan to create more simplified interfaces in the next few releases and are actively looking for feedback from the community in this Github discussion.

Data retention for downsampled data

We now have a new metric which is derived from existing raw metric data points. By default, it will use the same data retention policy as our other metrics, which might not be what we want. Usually, downsampled data is kept for longer since it allows for long term analysis without incurring the storage and performance costs of raw data. Therefore, we want to keep access to the aggregates of the data even after our raw metric data has been dropped.

To enable this, all we need to do is increase the retention period for our new metric. We do this by setting a retention period like we would do on any other metric in the system.

SELECT set_metric_retention_period('public', 'cpu_usage_1hour', INTERVAL '365 days');

This will increase the retention period of our continuous aggregate to a full year, even if the underlying metric data on which it was based has been deleted after it reached its retention period.

Considerations

Before using Promscale continuous aggregates there are a few considerations to take into account.

First, if the __column__ label matcher is not specified, it will default to value, which means it will try to query the column named value. If the column does not exist, we will get an empty result (since it won't match anything in the system). To take advantage of this fact, consider creating continuous aggregates with a value column set to the value you want as the default value when matching the metric. In our node_memfree_1hour example, we could have used the following continuous aggregate instead:

CREATE MATERIALIZED VIEW node_memfree_1hour
WITH (timescaledb.continuous) AS
  SELECT 
        timezone('UTC', 
          time_bucket('1 hour', time) AT TIME ZONE 'UTC' +'1 hour')  
            as time , 
        series_id,
        min(value) as min, 
        max(value) as max, 
        avg(value) as value
    FROM prom_data.node_memory_MemFree
    GROUP BY time_bucket('1 hour', time), series_id

With this configuration, getting the value of the average of node_memfree_1hour with PromQL would simply be

node_memfree_1hour

So no need to pass a __column__ selector.

Second, both the __schema__ and __column__ label matchers support only exact matching, no regex or other multi value matchers allowed. Also, metric views are excluded from queries that match multiple metrics (i.e., matching on metric names with a regex).

{__name__=~"node_mem*"} // this valid PromQL query will not match our previously created metric view

Finally, if we ingest a new metric with the same name as a registered metric view, it will result in the creation of a new metric with the same name but a different schema (all ingested metrics are automatically added to the prom_data schema, while the new aggregated metric would be in the public schema by default). This would likely cause confusion when querying a metric by its name since by default Promscale will query the metric in the prom_data schema (we could specify a different __schema__ label in our query). To avoid it, make sure you name your continuous aggregate views with a name that raw ingested metrics will not have, like node_memfree_1hour.

Conclusion

In this blog post, we have looked at two ways to downsample Prometheus metrics: Prometheus recording rules and Promscale continuous aggregates which offer additional capabilities.

Thankfully, Promscale supports downsampling and custom retention policies with both recording rules and continuous aggregates so you can choose the right solution for your needs.

There are a few things to take into account when deciding on a downsampling solution:

  • Access to recent data. If this materialization will be used in operational or real-time dashboards prefer continuous aggregates because of the  real-time aggregate feature.
  • Size of the time-bucket. Continuous aggregates materialize the intermediate, not the final form, so querying the data is a bit more expensive than with recording rules. Thus, continuous aggregates are better when aggregating more data points together (1 hour or more of data), while recording rules are better for small buckets.
  • Number of metrics in materialization. Currently, continuous aggregates can only be defined on a single metric. If you need to materialize queries on more than one metric, use recording rules. However, you should also consider whether joining the materialized metrics (the result of the materialization instead of the raw input) may be a better approach.
  • Query flexibility. If you know the exact queries that will be run on the materialization, recording rules may be more efficient. However, if you want flexibility, continuous aggregates can answer more queries based on the materialized data.
  • Access to old data. If you need old data points to also be aggregated as soon as downsampling for a metric is configured, continuous aggregates would be a better choice, especially if this is something you think you will be doing often, since recording rules require additional steps to backfill data.

Promscale continuous aggregates are currently in beta and we are actively looking for feedback from the community to help us make it better. Share your feedback in this Github discussion.

Get started with Promscale

If you’re new to Promscale and want to get started with our new multi-tenancy functionality today:

  • Install Promscale today via Helm Charts, Docker, and others. Follow the instructions in our GitHub repository. As a reminder, Promscale is open-source and completely free to use. (GitHub ⭐️  welcome and appreciated! 🙏.)
  • See our docs to enable multi-tenancy support on your Promscale instance today.
  • Check out our Getting Started with Promscale tutorial for more on how Promscale works with Prometheus, installation instructions, sample PromQL and SQL queries, and more.
  • Watch Promscale 101 YouTube playlist for step-by-step demos and best practices.

Whether you’re new to Promscale or an existing community member, we’d love to hear from you! Join TimescaleDB Slack, where you’ll find 7K+ developers and Timescale engineers active in all channels. (The dedicated #promscale channel has 2.5K+ members, so it’s a great place to connect with like-minded community members, ask questions, get advice, and more).

This post was written by
15 min read
Observability
Contributors

Related posts

TimescaleDB - Timeseries database for PostgreSQL

Explore TimescaleDB

Learn more about how TimescaleDB works, compare versions, and get technical guidance and tutorials.

Go to docs Go to products