How to create (lots!) of sample time-series data with PostgreSQL generate_series()
As the makers of TimescaleDB, we often need to quickly create lots of sample time-series data to demonstrate a new database feature, run a benchmark, or talk about use cases internally. We also see users in our community Slack asking how they can test a feature or decide if their data is a good fit for TimescaleDB.
Although using real data from your current application would be great (and ideal), knowing how to quickly create a representative time-series dataset using varying cardinalities and different lengths of time, is a helpful - and advantageous - skill to have.
Fortunately PostgreSQL provides a built-in function to help us create this sample data using the SQL that we already know and love - no external tools required.
In this three-part blog series, we'll go through a few ways to use the
generate_series() function to create large datasets in PostgreSQL, including:
- What PostgreSQL
generate_series()is and how to use it for basic data generation (similar to our tutorial to Simulate IoT Data)
- How to create more realistic looking time-series data with custom PostgreSQL functions
- Ways to create complex time-series data, using additional PostgreSQL math functions and JOINs.
By the end of this series, you'll be ready to test almost any PostgreSQL or TimescaleDB feature, create quick datasets for general testing, meetup presentations, demos, and more!
Intro to PostgreSQL generate_series()
generate_series() is a built-in PostgreSQL function that makes it easy to create ordered tables of numbers or dates. The PostgreSQL documentation calls it a Set Returning Function because it can return more than one row.
The function is simple and easy to get started with, taking at least two required arguments to specify the start and stop parameters for the generated data.
SELECT * FROM generate_series(1,5);
This will produce output that looks like this:
generate_series| ---------------| 1| 2| 3| 4| 5|
From this first example, we can see that
generate_series() returns sequential numbers between a start parameter and stop parameter. When used to generate numeric data,
generate_series() will increment the values by 1. However, there is an optional third parameter that can be used to specify the increment length, known as the step parameter.
For example, if we wanted to generate rows that counted by two from 0 to 10, we could use this SQL instead:
SELECT * from generate_series(0,10,2); generate_series ----------------- 0 2 4 6 8 10
PostgreSQL generate_series() with dates
generate_series() to produce a range of dates is equally straightforward. The only difference is that the third parameter, the step INTERVAL used to increment the date, is required for date generation, as shown here:
SELECT * from generate_series( '2021-01-01', '2021-01-02', INTERVAL '1 hour' ); generate_series ------------------------ 2021-01-01 00:00:00+00 2021-01-01 01:00:00+00 2021-01-01 02:00:00+00 2021-01-01 03:00:00+00 2021-01-01 04:00:00+00 2021-01-01 05:00:00+00 2021-01-01 06:00:00+00 2021-01-01 07:00:00+00 2021-01-01 08:00:00+00 2021-01-01 09:00:00+00 2021-01-01 10:00:00+00 2021-01-01 11:00:00+00 2021-01-01 12:00:00+00 2021-01-01 13:00:00+00 2021-01-01 14:00:00+00 2021-01-01 15:00:00+00 2021-01-01 16:00:00+00 2021-01-01 17:00:00+00 2021-01-01 18:00:00+00 2021-01-01 19:00:00+00 2021-01-01 20:00:00+00 2021-01-01 21:00:00+00 2021-01-01 22:00:00+00 2021-01-01 23:00:00+00 2021-01-02 00:00:00+00 (25 rows)
Notice that the returned dates are inclusive of the start and stop values, just as we saw with the numeric example before. The reason we got 25 rows (representing 25 hours rather than 24 as you might expect) is that the stop value can be reached using the equal one-hour INTERVAL (the step parameter). As long as the INTERVAL can increment evenly up to the stop date, it will be included.
However, if the step interval resulted in the stop value being skipped over, it will not be included in your output. For example, if we modify the step INTERVAL above to '1 hour 25 minutes', the result only returns 17 rows, the last of which is before the stop value.
SELECT * from generate_series( '2021-01-01','2021-01-02', INTERVAL '1 hour 25 minutes' ); generate_series ------------------------ 2021-01-01 00:00:00+00 2021-01-01 01:25:00+00 2021-01-01 02:50:00+00 2021-01-01 04:15:00+00 2021-01-01 05:40:00+00 2021-01-01 07:05:00+00 2021-01-01 08:30:00+00 2021-01-01 09:55:00+00 2021-01-01 11:20:00+00 2021-01-01 12:45:00+00 2021-01-01 14:10:00+00 2021-01-01 15:35:00+00 2021-01-01 17:00:00+00 2021-01-01 18:25:00+00 2021-01-01 19:50:00+00 2021-01-01 21:15:00+00 2021-01-01 22:40:00+00 (17 rows)
How to generate time-series data
Now that we understand how to use
generate_series(), how do we create some time-series data to insert into TimescaleDB for testing and visualization?
To do this, we utilize a standard feature of relational databases and SQL. Recall that
generate_series() is a Set Returning Function that returns a "table" of data (a set) just as if we had selected it from a table. Therefore, just like when we select data from a regular table with SQL, we can add more columns of data using other functions or static values.
You may have done this at some point with a string of text that needed to be repeated for each returned row.
SELECT 'Hello Timescale!' as myStr, * FROM generate_series(1,5); myStr | generate_series ------------------+----------------- Hello Timescale! | 1 Hello Timescale! | 2 Hello Timescale! | 3 Hello Timescale! | 4 Hello Timescale! | 5 (5 rows)
In this example, we simply added data to the rows being returned from
generate_series(). For every row it returned, we added a column with some static text.
But, these added columns don't have to be static data! Using the built-in
random() function (for example), we can generate data that starts to look a little more like data we'd see when monitoring computer CPU values (i.e., realistic data to use for our demos and tests!).
SELECT random()*100 as CPU, * FROM generate_series(1,5); cpu | generate_series --------------------+----------------- 48.905450626783775 | 1 71.94031820213382 | 2 25.210553719011486 | 3 19.24163308357194 | 4 8.434915599133674 | 5 (5 rows)
In this example,
generate_series() produced 5 rows of data and for every row, PostgreSQL also executed the
random() function to produce a value. With a little creativity, this can become a pretty efficient method for generating lots of data quickly.
Performance considerations when generating sample data
Before getting too deep into the many ways we can use various functions with
generate_series() to create more realistic data, let's tackle one potential downside with functions before your mind starts thinking about a myriad of ways to produce interesting, sample data.
Functions are a powerful database tool, allowing complex code to be hidden behind a standard interface. In the example above, I don't have to know how
random() produces a value. I just have to call it in my SQL statement (without arguments) and a random value is returned for every row.
Unfortunately, functions can slow down your query and use lots of resources if you're not careful. This is true any time you call functions in SQL, not just when creating sample data.
The reason for this inefficiency is that scalar functions are executed once for each column and row in which they are used. So, if you produce a set of 1,000 rows with
generate_series() and then add on 5 additional columns with data generated by functions, PostgreSQL has to effectively process 5,001 function calls, once to generate the initial series of data which is returned as a set, and 5 times that for each row because of the additional 5 columns.
The main issue is that PostgreSQL has no idea how to determine how much "work" it will take to generate the result from each function in the query. Something like
random() required very minimal effort, so calling it 5,000 or even 1 million times won't break the bank for most machines. However, if the function you're calling generates lots of temporary data internally before producing a result, be aware that it could perform more slowly and consume more resources on your database server.
The takeaway here is that nothing comes for free, especially when functions are called once for every column, for every row. Most of the time the data you generate will easily complete in a handful of seconds for tens of millions of rows. But, if you notice a query that generates data taking a long time, consider finding alternative ways to generate data for the columns that require more resources.
There is a second caveat to be aware of when using a tool like
generate_series(), at least as demonstrated throughout this series. All data is created and inserted as a single transaction. You can put in more time and effort to create functions and methods for breaking up the work, but understand if you create (SELECT) one large set of data with 100 million rows, it's likely to consume a lot of memory and CPU on the server while the process occurs.
This doesn't mean it's broken or any less valid of a method for generating data. However, as we demonstrate additional ways to create more realistic data in parts 2 and 3 of this series, recognize that the more data you create, the more resources you'll need from the server without managing batches and transactions with a more advanced setup.
Alright, let's get back to the fun!
Tips for quickly increasing sample dataset scale
So far we've looked at how
generate_series() works and how you can use functions to add additional dynamic content to the output. But how do we quickly get to the scale of tens of millions of rows (or more)?
This is where the concept of a Cartesian product comes into play with databases. A Cartesian product (otherwise known as a CROSS JOIN) takes two or more result sets (rows from multiple tables or functions) and produces a new result set that contains rows equal to the count of the first set multiplied by the count of the second set.
In doing so, the database outputs every row from the first table with the value of the first row from the second table. It does this over and over until all rows in both tables have been iterated. (and yes, if you join more than two tables this way, the process just keeps multiplying, tables processed left to right)
Let's look at a small example using two
generate_series() in the same select statement.
SELECT * from generate_series(1,10) a, generate_series(1,2) b; a |b| --+-+ 1|1| 2|1| 3|1| 4|1| 5|1| 6|1| 7|1| 8|1| 9|1| 10|1| 1|2| 2|2| 3|2| 4|2| 5|2| 6|2| 7|2| 8|2| 9|2| 10|2|
As you can see, PostgreSQL generated 10 rows for the first series and 2 rows for the second series. After processing and iterating over each table consecutively, the query produced a total of 20 rows (10 x 2). For generating lots of data quickly, this is about as easy as it gets!
The same process applies when using
generate_series() to return a set of dates. If you (effectively) CROSS JOIN multiple sets (numbers or dates), the total number of rows in the final set will be a product of all sets. Again, for generating sample time-series data, you'd be hard-pressed to find an easier method!
In this example, we generate 12 timestamps an hour apart, a random value representing CPU usage, and then a second series of four values that represent IDs for fake devices. This should produce 48 rows (eg. 12 timestamps x 4 device IDs = 48 rows).
SELECT time, device_id, random()*100 as cpu_usage FROM generate_series( '2021-01-01 00:00:00', '2021-01-01 11:00:00', INTERVAL '1 hour' ) as time, generate_series(1,4) device_id; time |device_id|cpu_usage | -------------------+---------+-------------------+ 2021-01-01 00:00:00| 1|0.35415126479989567| 2021-01-01 01:00:00| 1| 14.013393572770028| 2021-01-01 02:00:00| 1| 88.5015939122006| 2021-01-01 03:00:00| 1| 97.49037810105996| 2021-01-01 04:00:00| 1| 50.22781125586846| 2021-01-01 05:00:00| 1| 77.93431470586931| 2021-01-01 06:00:00| 1| 45.73481750582076| 2021-01-01 07:00:00| 1| 70.7999843735724| 2021-01-01 08:00:00| 1| 4.72949831884506| 2021-01-01 09:00:00| 1| 85.29122113229981| 2021-01-01 10:00:00| 1| 14.539664281598874| 2021-01-01 11:00:00| 1| 45.95244258556228| 2021-01-01 00:00:00| 2| 46.41196423062297| 2021-01-01 01:00:00| 2| 74.39903569177027| 2021-01-01 02:00:00| 2| 85.44087332221935| 2021-01-01 03:00:00| 2| 4.329394730750735| 2021-01-01 04:00:00| 2| 54.645873866589056| 2021-01-01 05:00:00| 2| 6.544334492894777| 2021-01-01 06:00:00| 2| 39.05071228953645| 2021-01-01 07:00:00| 2| 71.07264365438404| 2021-01-01 08:00:00| 2| 72.4732704336219| 2021-01-01 09:00:00| 2| 34.533280927542975| 2021-01-01 10:00:00| 2| 26.764760864598003| 2021-01-01 11:00:00| 2| 62.32048879645227| 2021-01-01 00:00:00| 3| 63.01888063314749| 2021-01-01 01:00:00| 3| 21.70606884856987| 2021-01-01 02:00:00| 3| 32.47610779097485| 2021-01-01 03:00:00| 3| 47.565982341726354| 2021-01-01 04:00:00| 3| 64.34867263419619| 2021-01-01 05:00:00| 3| 57.74424991855476| 2021-01-01 06:00:00| 3| 55.593286571750156| 2021-01-01 07:00:00| 3| 36.92650110894995| 2021-01-01 08:00:00| 3| 53.166926049881624| 2021-01-01 09:00:00| 3| 10.009505806123897| 2021-01-01 10:00:00| 3| 58.067700285561585| 2021-01-01 11:00:00| 3| 81.58883725078034| 2021-01-01 00:00:00| 4| 78.1768041898232| 2021-01-01 01:00:00| 4| 84.51505102850199| 2021-01-01 02:00:00| 4| 24.029611792753514| 2021-01-01 03:00:00| 4| 17.08996115345549| 2021-01-01 04:00:00| 4| 29.642690955760997| 2021-01-01 05:00:00| 4| 90.83844806413275| 2021-01-01 06:00:00| 4| 6.5019080489854275| 2021-01-01 07:00:00| 4| 32.336484070672| 2021-01-01 08:00:00| 4| 55.595524107963| 2021-01-01 09:00:00| 4| 97.5442141375293| 2021-01-01 10:00:00| 4| 37.0741925805568| 2021-01-01 11:00:00| 4| 19.093927249791776|
Choosing a date range
Now it's time to put all of the features and concepts together. We've seen how to use
generate_series() to create a sample table of data (both numbers and dates), add static and dynamic content to each row, and finally how to join multiple sets of data together to create a deterministic number of rows to create test data.
The final piece of the puzzle is to figure out how to create date ranges that are more dynamic using date math. Once again, PostgreSQL makes this straightforward because date math is handled automatically.
When generating sample data you usually have an idea of the duration of time you want to generate - 1 month, 6 months, or a year - for instance. But calculating the exact start and end timestamps for your use case can get pretty tedious.
Instead, it's often easier to use
now() or a static ending timestamp (ie. '2021-06-01 00:00:00'), and then using date math to get the starting timestamp based on your chosen interval. This makes it very easy to generate data for different durations simply by changing the interval. And to be clear, you can go the other direction (picking a start timestamp and adding time), but that can often lead to data in the future.
Let's look at three examples to demonstrate ways of creating dynamic date ranges.
Create 6 months of data with one-hour intervals, ending now()
SELECT time, device_id, random()*100 as cpu_usage FROM generate_series( now() - INTERVAL '6 months', now(), INTERVAL '1 hour' ) as time, generate_series(1,4) device_id;
Notice that we use the PostgreSQL function
now() to automatically choose the ending timestamp, and then use date math (- INTERVAL '6 months') to let PostgreSQL find the starting timestamp for us. With almost no effort we can easily generate weeks, months, or years of time-series data by changing the INTERVAL we subtract from
Create 1 year of data with one-hour intervals, ending on a timestamp
SELECT time, device_id, random()*100 as cpu_usage FROM generate_series( '2021-08-01 00:00:00' - INTERVAL '6 months', '2021-08-01 00:00:00', INTERVAL '1 hour' ) as time, generate_series(1,4) device_id;
This example is the same as the first, but here we specify the timestamp that we want to end on. PostgreSQL can still do the date math for us (subtracting 6 months in this example), but we get control over the exact ending timestamp to use. This is particularly useful when you want the actual time portion of the timestamp to begin and end on even, rounded hours, minutes, or seconds. When we use
now(), the timestamp can produce (what feels like) random timestamp causing all of the time-series data to have timestamps that are increments of
Create 1 year of data with one-hour intervals, beginning on a timestamp
For this last example, we're going to specify the start timestamp instead. This can be useful when you need to test a specific fiscal period or maybe some logic that deals with the turning of each year. In this case, maybe you just need a month of data but it needs to cross over from one year to the next. In that case, trying to figure out the math of how far back to start and how far to go forward might be more complicated than you want to worry about.
SELECT time, device_id, random()*100 as cpu_usage FROM generate_series( '2020-12-15 00:00:00', '2020-12-15 00:00:00' + INTERVAL '2 months', INTERVAL '1 hour' ) as time, generate_series(1,4) device_id;
Each of these examples uses date math to help you find the appropriate start and end timestamps so that you can easily adjust to create more or less data while still fitting the range profile that you need.
Speaking of calculating how many rows you want to generate… 😉
Calculating total rows
We now have all the tools we need to create datasets of almost any size. For time-series data specifically, it's a straightforward calculation to figure out how many rows your query will generate.
Total Rows = Readings per hour * Total hours * Number of "things" being tracked
Using this formula, we can quickly determine how many rows will be created by changing any combination of the total range (start/end timestamps), how many readings per hour for each item, or the total number of items. Let's look at a couple of examples to get an idea of how quickly you could create many rows of time-series data for your testing scenario.
|Range of readings||Length of interval||Number of "devices"||Total rows|
|1 year||10 minutes||100||5,256,000|
|6 months||5 minutes||1,000||52,560,000|
As you can see, the numbers start to add up very quickly.
In this first post, we've demonstrated that it's pretty easy to generate lots of data using the PostgreSQL
generate_series() function. We also learned that when you select multiple sets (using
generate_series() or selecting from tables and functions), PostgreSQL will produce what's known as a Cartesian product, the selection of all rows from all tables - the product of all rows. With this knowledge, we can quickly create large and diverse datasets to test various features of PostgreSQL and TimescaleDB.
For more on how to create, find, and use demo data in tutorials, blog posts, and more, please watch my Creating sample data video:
In part 2 of this series, we'll demonstrate how to use custom user-defined functions to create more realistic-looking data to use for testing, including generated text, numbers constrained by a range, and even fake JSON data.
If you have questions about using
generate_series() or have any questions about TimescaleDB, please join our community Slack channel where you'll find an active community and a handful of the Timescale team most days.
If you want to try creating larger sets of sample time-series data using
generate_series() and see how the exciting features of TimescaleDB work, sign up for a free 30-day trial or install and manage it on your instances. (You can also learn more by following one of our many tutorials.)
See what's possible when you join relational data in PostgreSQL with time-series superpowers!