Introducing hyperfunctions: new SQL functions to simplify working with time-series data in PostgreSQL

Introducing hyperfunctions: new SQL functions to simplify working with time-series data in PostgreSQL

Today, we’re excited to launch TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunctions to calculate percentile approximations of data, compute time-weighted averages, downsample and smooth data, and perform faster COUNT DISTINCT queries using approximations. Moreover, hyperfunctions are “easy” to use: you call a hyperfunction using the same SQL syntax you know and love.

At Timescale, our mission is to enable every software developer to store, analyze, and build on top of their time-series data, so that they can measure what matters in their world: IoT devices, IT systems, marketing analytics, user behavior, financial metrics, and more. (For example, we’ve built a free multi-node, petabyte-scale, time-series database; a multi-cloud, fully-managed service for time-series data; and Promscale, an open-source analytics platform for Prometheus monitoring data.)

We made the decision early in the design of TimescaleDB to build on top of PostgreSQL. We believed then, as we do now, that building on the world’s fastest-growing database would have numerous benefits for our customers. Perhaps the biggest of these advantages is in developer productivity. Developers can use the tools and frameworks they know and love and bring all their skills and expertise with SQL with them.

SQL is a powerful language and we believe that by adding a specialized set of functions for time-series analysis, we can make it even better.

Today, there are nearly three million active TimescaleDB databases running mission-critical time-series workloads across industries. Time-series data comes at you fast, sometimes generating millions of data points per second. In order to measure everything that matters, you need to capture all of the data you possibly can. Because of the volume and rate of information, time-series data can be complex to query and analyze.

As we interviewed customers and learned how they analyze and manipulate time-series data, we noticed several common queries begin to take shape. Often, these queries were difficult to compose in standard SQL. TimescaleDB hyperfunctions are a series of SQL functions to address the most common, and often most difficult, queries developers write today. We made the decision to take the hard path ourselves so that we could give developers an easier path.

Hyperfunctions included in this initial release

Today, we’re releasing several hyperfunctions, including:

  • Time-Weighted Average allows you to take the average over an irregularly spaced dataset that only includes changepoints.
  • Percentile-Approximation brings percentile analysis to more workflows. When used with continuous aggregates, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data. For maximum control, we provide implementations of two different approximation algorithms:
    • Uddsketch gives formal guarantees to the accuracy of approximate percentiles, in exchange for always returning a range of possible values.
    • T-Digest gives fuzzier guarantees which allow it to be more precise at the extremes of the distribution.
  • Hyperloglog enables faster approximate COUNT DISTINCT, making it easier to track how the cardinality of your data changes over time.
  • Counter Aggregate enables working with counters in an ergonomic SQL-native manner.
  • ASAP Smoothing smooths datasets to bring out the most important features when graphed.
  • Largest Triangle Three Buckets Downsampling reduces the number of elements in a dataset while retaining important features when graphed.
  • Stats-agg makes using rolling, cumulative and normalized statistics as easy as their standard counterparts.

Note that Hyperfunctions work on TimescaleDB hypertables, as well as regular PostgreSQL tables.

New SQL functions, not new SQL syntax

We made the decision to create new SQL functions for each of the time-series analysis and manipulation capabilities above. This stands in contrast to other efforts which aim to improve the developer experience by introducing new SQL syntax.

While introducing new syntax with new keywords and new constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe that it actually leads to a worse experience for the end-user.

New SQL syntax means that existing drivers, libraries, and tools may no longer work. This can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax, or may require time-consuming modifications to do so.

On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected.

Hyperfunctions are written in Rust

Rust was our language of choice for developing the new hyperfunctions. We chose it for its superior productivity, community, and the pgx software development kit. We felt Rust was a more friendly language for a project like ours and would encourage more community contributions.

The inherent safety of Rust means we could focus more time on feature development rather than worrying about how the code is written. The extensive Rust community (💗 crates.io), along with excellent package-management tools, means we can use off-the-shelf solutions for common problems, leaving us more time to focus on the uncommon ones.

On the topic of community, we found the Rust community to be one of the friendliest on the internet, and its commitment to open source, open communication, and good documentation make it an utter joy to work with. Libraries such as Serde and quickcheck make common tasks a breeze and lets us focus on the code that’s novel to our project, instead of writing boilerplate that's already been written by thousands of others.

We’d like to shout out ZomboDB’s pgx, an SDK for building Postgres extensions using Rust. Pgx provides tools to generate extension scripts from Rust files and bind Rust functions to Postgres functions, as well as tools to set up, run, and test PostgreSQL instances. (For us, it’s been an amazing tool and experience with incredible benefits – we estimate that pgx has reduced our workload by at least one-third!.)

Next steps

In the rest of this post, we detail why we chose to build new SQL functions (not new SQL syntax), and explore each hyperfunction and its example usage.

But if you’d like to get started with hyperfunctions right away, the easiest way to do so is with a fully managed TimescaleDB service. Try it for free (no credit card required) for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above.

Finally, we love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today.

We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

To learn more about hyperfunctions, please continue reading.

Building new SQL functions instead of reinventing syntax

SQL is the third most popular programming language in the world. It’s the language known and loved by many software developers, data scientists, and business analysts the world over, and it's a big reason we chose to build TimescaleDB on top of PostgreSQL in the first place.

Similarly, we choose to make our APIs user-friendly without breaking full SQL compatibility. This means we can create custom functions, aggregates, and procedures but no new syntax - and all the drivers and interfaces can still work. You get the peace of mind that your query will run in every visualization tool, database admin tool, or data analysis tool that speaks SQL.

SQL is powerful and it’s even Turing complete, so you can technically do anything with it. But that doesn’t mean you’d want to 😉. Our hyperfunctions are made to make complex analysis and time-series manipulation in SQL simpler, without undermining the guarantees of full SQL compatibility. We’ve spent a large amount of our time on design; prototyping and just writing out different names and calling conventions for clarity and ease of use.

Our guiding philosophy is to make simple things easy and complex things possible. We enable things that feel like they should be straightforward, like using a single function call to calculate a time-weighted average of a single item over a time period. We also enable operations that would otherwise be prohibitively expensive (in terms of complexity to write) or would previously take too long to respond to be useful. For example, calculating a rolling time-weighted average of each item normalized to the monthly average of the whole group of things.

For example, we’ve implemented a default for percentile approximation called percentile_agg that should work for most users, while also exposing the lower level UDDsketch and tdigest implementations for users who want to have more control and get into the weeds.

Another advantage of using SQL functions rather than new syntax is that we bring your code closer to your data, rather than forcing you to take your data to your code. Simply put, you can now perform more sophisticated analysis and manipulation operations on your data right inside your database, rather than creating data pipelines to funnel data into Python or other analysis libraries to conduct analysis there.

We want to make the more complex analysis simpler and easier in the database not just because we want to build a good product, but also because it’s far, far more efficient to do your analysis as close to the data as possible, and then get aggregated or other simpler results that get passed back to the user.

This is because the network transmission step is often the slowest and most expensive part of many calculations, and because the serialization and deserialization overhead can be very large as you get to large datasets. So by making these functions and all sorts of analysis simpler to perform in the database, nearer to the data, developers save time and money.

Moreover, while you could perform some of the complex analysis enabled by hyperfunctions in other languages inside the database (e.g., programs in Python or R), hyperfunctions now enable you to perform such sophisticated time-series analysis and manipulation in SQL right in your query statements, making them more accessible.

Hyperfunctions released today

Hyperfunctions refactor some of the most gnarly SQL queries for time-series data into concise, elegant functions that feel natural to any developer that knows SQL. Let’s walk through the hyperfunctions we’re releasing today and the ones that will be available soon.

Back in January, when we launched our initial hyperfunctions release, we asked for feedback and input from the community. We want this to be a community-driven project, so for our 1.0 release, we’ve prioritized several features requested by community members. We’ll have a brief overview here, with a technical deep dive into each family of functions in a series of separate blog posts in the coming weeks.

Time-weighted averages

Time-series averages can be complicated to calculate; generally, you need to determine how long each value has been recorded in order to know how much to weigh them. While doing this in native SQL is possible, it is extremely error-prone and unwieldy. More damingly, the SQL needed would not work in every context. In particular, it would not work in TimescaleDB’s automatically refreshing materialized views, continuous aggregates, so users who wanted to calculate time-weighted averages over multiple time intervals would be forced to rescan the entire dataset for each average so calculated. Our time-weighted average hyperfunction removes this complexity and can be used in continuous aggregates to make multi-interval time-weighted averages as cheap as summing a few sub-averages.

Here’s an example of using time-weighted averages for an IoT use case, specifically to find the average temperature in a set of freezers over time. (Notice how it takes sixteen lines of complex code to find the time-weighted average using regular SQL, compared just five lines of code with SELECT statements when using the TimescaleDB hyperfunction):

Time-weighted average using TimescaleDB hyperfunction

SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;
 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778

Time-weighted average using regular SQL

WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id, 
	avg(temperature),
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 
FROM nextstep
GROUP BY freezer_id;
 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.35 |     6.802777777777778

Percentile approximation (UDDsketch & TDigest)

Aggregate statistics are useful when you know the underlying distribution of your data, but for other cases, they can be misleading. For cases where they don’t work, and for more exploratory analyses looking at the ground truth, percentiles are useful.

As useful as it is, percentile analysis comes with one major downside: it needs to store the entire dataset in memory. This means that such analysis is only feasible for relatively small datasets, and even then can take longer than ideal to calculate.

The approximate-percentile hyperfunctions we’ve implemented suffer from neither of these problems: they take constant storage, and, when combined with automatically refreshing materialized views, they can produce results nearly instantaneously. This performance improvement opens up opportunities to use percentile analysis for use cases and datasets where it was previously unfeasible.

Here’s an example of using percentile approximation for a DevOps use case, where we alert on response times that are over the 95th percentile:

WITH “95th percentile” as (
    SELECT approx_percentile(0.95, percentile_agg(response_time)) as threshold
    FROM response_times
)
SELECT count(*) 
FROM response_times 
AND response_time > “95th percentile”.threshold;

See our hyperfunctions docs to get started today. In the coming weeks, we will be releasing a series of blog posts which detail each of the hyperfunctions released today, in the context of using them to solve a real-world problem.

Hyperfunctions in public preview

In addition to the hyperfunctions released today, we’re making several hyperfunctions available for public preview. These include hyperfunctions for downsampling, smoothing, approximate count-distinct, working with counters, and working with more advanced forms of averaging. All of these are available for trial today through our experimental schema, and, with your feedback, will be made available for production usage soon.

Here’s a tour through each hyperfunction and why we created them:

Graph Downsampling & Smoothing

We have two algorithms implemented to help downsample your data for better, faster graphing:

The first graphing algorithm for downsampling is Largest triangle three bucket (LTTB). LTTB limits the number of points you need to send to your graphing engine while maintaining visual acuity. This means that you don’t try to plot 200,000 points on a graph that’s only 2000 pixels wide, which is inefficient in terms of network and rendering costs.

Given an original dataset which looks like the graph below:

Raw data

We can downsample it to just 34 points with the following query using the LTTB hyperfunction:

SELECT toolkit_experimental.lttb(time, val, 34)

The above query yields the following graph, which retains the periodic pattern of the original graph, with just 34 points of data.

Raw data

The second graphing algorithm for downsampling is Automatic smoothing for attention prioritization (ASAP smoothing). ASAP Smoothing uses optimal moving averages to smooth a graph to remove noise and make sure that trends are obvious to the user, while not over-smoothing and removing all the signals as well. This leads to vastly improved readability.

For example, the graph below displays 250 years of monthly temperature readings from England (raw data can be found here):

Raw data

We can run the following query using the ASAP smoothing hyperfunction:

SELECT toolkit_experimental.asap_smooth(month, value, 800) FROM temperatures

The result is the graph below, which is much less noisy than the original and one where users can more easily spot trends.

Smoothed data

Counter Aggregates

Metrics generally come in a few different varieties, which many systems have come to call gauges and counters. A gauge is a typical metric that can vary up or down, something like temperature or percent utilization. A counter is meant to be monotonically increasing. So it keeps track of, say, the total number of visitors to a website. The main difference in processing counters and gauges is that a decrease in the value of a counter (compared to its previous value in the time-series) is interpreted as a reset.  TimescaleDB’s counter aggregate hyperfunctions enable a simple and optimized analysis of these counters.

For example, despite a dataset being stored like:

data
------
  10
  20
   0
   5
  15

We can calculate the delta (along with various other statistics) over this monotonically-increasing counter with the following query using the counter aggregate hyperfunction:

SELECT toolkit_experimental.delta(
    toolkit_experimental.counter_agg(ts, val))
FROM foo;
 delta
------
  40

Hyperloglog for Approximate Count Distinct

We’ve implemented a version of the hyperloglog algorithm to do approximate count distinct queries over data in a more efficient and parallelizable fashion. For existing TimescaleDB users, you’d be happy to hear that they work in continuous aggregates, which are automatically refreshing materialized views, as well as on multi-node deployments of TimescaleDB.

Statistical Aggregates

Calculating rolling averages and other statistical aggregates over tumbling windows is very difficult in standard SQL because to do it accurately you’d need to separate out the different components (ie for average, count and sum) and then calculate it yourself. Our statistical aggregates allow you to simply do this, with simple rollup.

To follow the progress and contribute to improving these (and future) hyperfunctions, you can view our roadmap on GitHub. Our development process is heavily influenced by community feedback, so your comments on issues and discussion threads will help determine which features get prioritized, and when they’re stabilized for release.

Next Steps

Try hyperfunctions today with a fully-managed Timescale Forge service (no credit card required, free for 30 days). Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you’ve created a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub for free, after which you’ll be able to use all the hyperfunctions listed above.

We love building in public. You can view our upcoming roadmap on GitHub for a list of proposed features, as well as features we’re currently implementing and those that are available to use today. We also welcome feedback from the community (it helps us prioritize the features users really want). To contribute feedback, comment on an open issue or in a discussion thread in GitHub.

This post was a collaboration between
13 min read
Always Be Launching
Contributors

Related posts

TimescaleDB - Timeseries database for PostgreSQL

Explore TimescaleDB

Learn more about how TimescaleDB works, compare versions, and get technical guidance and tutorials.

Go to docs Go to products