A primer on time-series data, what it is, where to store it, and how to analyze it to gain powerful insights.
(Note: this post was originally published in November 2018, and republished in December 2020 with updated graphs, new trends, and relevant technical information.)
Here’s a riddle: what do self-driving Teslas, autonomous Wall Street trading algorithms, smart homes, transportation networks that fulfill lightning-fast same-day deliveries, and tracking the daily COVID-19 statistics and air quality in your community have in common?
For one, they are signs that our world is changing at warp speed, thanks to our ability to capture and analyze more and more data faster than ever before.
However, if you look closely, you’ll notice that each of these applications requires a special kind of data:
- Self-driving cars continuously collect data about how their environment is changing, adjusting based on weather conditions, potholes, and countless other variables.
- Autonomous trading algorithms continuously collect data on how the markets are changing to optimize returns, both in the short and long-term. (Read how financial applications like TransferWise and an automated crypto trading bot collect, store, and analyze data.)
- Our smart homes monitor what’s going on inside of them to regulate temperature, identify intruders, and respond to our every beck-and-call (“Alexa, play some relaxing music”).
- Our retail industry monitors how their assets move with such precision and efficiency that cheap same-day delivery is a luxury that many of us take for granted.
But, far from stock trends, self-driving cars, and knowing the exact minute your next online purchase will arrive, 2020 has provided the most personal example of how time-series data collection and analysis affects our daily lives.
For the first time in history, worldwide interest in time-series data has peaked in the most unexpected way. COVID-19 and the global pandemic have made billions of people across the globe relentless consumers of time-series data, demanding accurate and timely information about the daily trend of various COVID-19 statistics.
Having access to detailed, feature rich time-series data has become one of the most valuable commodities in our information-hungry world. Businesses, governments, schools, and communities, large and small, are finding invaluable ways to mine value from analyzing time-series data. (You can read how some real-world teams, like those tracking real-time flight data or building platforms for sustainable farming mine their time-series metrics in our Developer Q&A series)
Software developer usage patterns already reflect the same trend. In fact, over the past two years, time-series databases (TSDBs) have steadily remained the fastest growing category of databases:
As the developers of an open-source time-series database, my team and I are often asked about this trend and how it should factor into your decisions about which database to select. Specifically, does it really matter if you start with a database specialized for time-series data – or can you easily transition to one later?
To answer those questions, let me start with a more in-depth description of what time-series data is and how you might benefit from using a time-series database, and leave you with a few ways to start exploring time-series data and performing your own analysis.
What is time-series data?
Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. Time-series data can track changes over milliseconds, days, or even years.
In the past, our view of time-series data was more static; the daily highs and lows in temperature, the opening and closing value of the stock market, or even the daily or cumulative hospitalizations due to COVID-19.
Unfortunately, these totals missed the nuances of how the underlying changes over time contributed to these static values.
Let’s consider a few examples.
If I send you $10, a traditional bank database would debit my account and credit your account. Then, if you send me $10, the same process happens in reverse. At the end of this process, our bank balances would look the same, so the bank might think, “Oh, nothing changed this month.” But, with a time-series database, the bank would see, “Hey, these two people keep sending each other $10, there’s likely a deeper relationship here.” Tracking this nuance, our month-ending account balance takes on greater meaning.
Next, think about an environmental value like mean daily temperature (MDT), the average of the high and low temperature, for consecutive days at a location. Over the last few decades, MDT has been used as a primary variable to calculate buildings’ energy efficiency. In any given week, MDT might only vary slightly from day-to-day in a location, but the contributing environmental factors could be changing drastically over that same period. Instead, knowing how the temperature changed each hour throughout the day, coupled with precipitation, cloud cover, and wind speed during that time, could dramatically improve your ability to model and optimize energy efficiency for your properties.
Likewise, while knowing the total number of COVID-19 hospitalizations per day in your community is valuable, that number alone isn’t very descriptive. For instance, the hospital might disclose daily numbers that show 20 hospitalizations on Monday and increase slightly throughout the week to total 23 hospitalizations on Friday. At first glance, it looks like a 15% increase in hospitalizations this week – but if we add detail to each of those records (and increase the frequency at which we collect them), we might see that it was a net increase of 3 patients, but in reality there were 10 people discharged and 13 admitted, an increase of 65% for new admissions over the last 5 days.
Tracking each aspect of patient data over time (e.g., patient age, admitted or discharged, days to recovery, etc.) helps us understand how we arrive at the daily counts, allowing us to better analyze trends, accurately report totals, and take action. In the case of total COVID-19 hospitalizations, the details behind this analysis impact public policy in the cities and towns where we live.
These examples illustrate how modern time-series data is different from what we’ve known in the past. Time-series data analysis goes far deeper than a pie chart or Excel workbook with columns of summarized totals.
This detailed data doesn’t just include time as a metric, but as a primary component that helps to analyze our data and derive meaningful insights.
And, there are many other kinds of time-series data, but regardless of the scenario or use case, all time-series datasets have 3 things in common:
- The data that arrives is almost always recorded as a new entry
- The data typically arrives in time order
- Time is a primary axis (time-intervals can be either regular or irregular)
In other words, time-series data workloads are generally “append-only.” While they may need to correct erroneous data after the fact, or handle delayed or out-of-order data, these are exceptions, not the norm.
But, I already track a timestamp
You may ask: How is this different than just having a time-field in a dataset? Well, it depends: how does your dataset track changes? By updating the current entry, or by inserting a new one?
When you collect a new reading for sensor_x, do you overwrite your previous reading, or do you create a brand new reading in a separate row? While both methods will provide the current state of the system, you can only analyze the changes in state over time if you insert a new reading each time.
Simply put: time-series datasets track changes to the overall system as INSERTs, not UPDATEs.
This practice of recording each and every change to the system as a new, different row is what makes time-series data so powerful. It allows us to measure and analyze change: what has changed in the past, what is changing in the present, and what can we forecast changes may look like in the future.
In short, here’s how I like to define time-series data: a collection of values that represents how a system/process/behavior changes over time.
This is more than just an academic distinction. By centering our definition around “change,” we can identify time-series datasets that we aren’t collecting today and identify opportunities to start collecting that data now, so that we can harness its value later. All too often, people have time-series data but don’t realize it.
Time-series data hiding in plain sight?
Can you think of some common examples of time-series data in your day-to-day work? Are there reports or analysis you’ve been asked to help create, but lacked the data fidelity to do so?
Imagine you maintain a web application. Every time a user logs in, you may just update a “last_login” timestamp for that user in a single row in your “users” table. But, what if you treated each login as a separate event, and collected them over time? With that kind of time-series data you could analyze historical login activity, see how usage is (in-/de-)creasing over time, bucket users by how often they access the app, and more.
Another example has become vital to every IT group around the world: operational metrics for servers, networks, applications, environments, and more. This kind of time-series metric data is crucial to keeping the services that we rely on, running without interruption. By tracking the changes in each metric, IT departments can quickly identify problems, plan for capacity increases during upcoming events, and diagnose if an application update resulted in changed user behavior, for better or worse. (See how LAIKA uses Timescale to track resource consumption and plan for future needs.)
These examples illustrate a key point: preserving the inherent time-series nature of our data allows us preserve valuable information about how that data changes over time. You may also notice that both of these examples describe a common type of time-series data known as event data.
Of course, storing data at this resolution comes with an obvious problem: you end up with a lot of data, rather fast. So that’s the catch: being able to analyze increased amounts of time-series data is more valuable than ever, but it piles up very quickly.
Having a lot of data creates a different set of problems, both when recording it and when trying to query it in a performant way, which is why people are turning to time-series databases in greater numbers than ever before. The world is demanding that we make better data-driven decisions, faster. The static snapshots found in traditional data won’t cut it. To satisfy the demand, you need to be collecting data at the highest fidelity possible – and that’s what time-series data provides: the dynamic movie of what’s happening across your system (whether it’s your software, your physical power plant, your game, or customers inside your application).
Why do I need a time-series database?
You might ask: Why can’t I just use a “normal” (i.e., non-time-series) database?
The truth is that you can, and some people do. But, there’s at least two reasons why TSDBs are the fastest growing category of databases today: scale and usability.
Scale: Time-series data accumulates very quickly, and normal databases are not designed to handle that scale (at least not in an automated way). Traditionally, relational databases fare poorly with very large datasets, while NoSQL databases are better at scale (although a relational database fine-tuned for time-series data can actually perform better, as we’ve shown in benchmarks versus InfluxDB, versus Cassandra, and versus MongoDB). In contrast, time-series databases - whether they’re relational or NoSQL-based - introduce efficiencies that are only possible when you treat time as a first-class citizen. These efficiencies allow them to offer massive scale, from performance improvements, including higher ingest rates and faster queries at scale (although some support more queries than others) to better data compression.
Usability: TSDBs also typically include built-in functions and operations common to time-series data analysis, such as data retention policies, continuous queries, flexible time aggregations, etc. Even if you’re just starting to collect this type of data and scale is not a concern at the moment, these features can still provide a better user experience and make data analysis tasks easier. Having built-in functions and features to analyze trends readily available at the data-layer often leads you to discover opportunities you didn’t know existed, no matter how big or small your dataset
This is why developers are increasingly adopting time-series databases and using them for a variety of use cases:
- Monitoring software systems: Virtual machines, containers, services, applications
- Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies
- Asset tracking applications: Vehicles, trucks, physical containers, pallets
- Financial trading systems: Classic securities, newer cryptocurrencies
- Eventing applications: Tracking user/customer interaction data
- Business intelligence tools: Tracking key metrics and the overall health of the business
- (and more)
Once you begin to see more of the information your applications store as time-series data, you still have to pick a time-series database that best fits your data model, write/read pattern, and developer skill sets. Although NoSQL time-series database options have prevailed for the past decade as the storage medium of choice, more and more developers are seeing the downside to storing time-series data separately from business data (most time-series databases don’t provide good support for relational data). In fact, this poor developer experience was one of the driving factors in why we started Timescale. Keeping all of your data in one system can drastically reduce application development time – and the speed at which you can make key decisions.
Nowhere is this more evident than with the rise of numerous self-service business intelligence tools like Tableau, Power BI, and yes, even Excel. When precious time-series data is kept separate from business data, users struggle to make timely, business critical observations. Instead, users find that they need to rely on these third-party tools to mash-up data into something meaningful. There are many valid and good reasons to use these powerful tools, but being able to quickly query your time-series data alongside meaningful metadata information shouldn’t be one of them. SQL has been built and honed over decades to provide efficient ways of generating these valuable aggregations and analysis.
The bottom line: knowing where your time-series data is and where you store it can have a dramatic impact on your future success.
Is all data time-series data?
For the past decade or so, we have lived in the era of “Big Data,” to the point where it’s almost reached buzzword status; organizations of all sizes and types collect massive amounts of information about our world and apply computational resources to make sense of it.
Even though this era started with modest computing technology, our ability to capture, store, and analyze data has improved at an exponential pace, thanks to major macro-trends: Moore’s law, Kryder’s law, cloud computing, an entire industry of “big data” technologies.
We are no longer content to just observe the state of the world. Now, we need to measure how our world changes over time, down to sub-second intervals. Our “Big Data” datasets are now being dwarfed by another type of data, one that relies heavily on time to preserve information about the change that is happening.
Does all data start off as time-series data? Recall the earlier web application example: we had time-series data, but didn’t realize it: tracking user activity that would help you analyze engagement. Or think of any “normal” dataset. Say, the current accounts and balances at a major retail bank. Or the source code for a software project. Or the text for this article.
Typically we choose to store the latest state of the system, but instead, what if we stored every change and computed the latest state at query time? Isn’t a “normal” dataset just a view on top of an inherently time-series dataset (cached for performance reasons)? Don’t banks have transaction ledgers? (And aren’t blockchains just distributed, immutable time-series logs?) Doesn’t a software project have version control (e.g., git commits)? Doesn’t this article have revision history? (Undo. Redo.)
Put differently: Don’t all databases have logs?
We recognize that many applications may never require time-series data (and would be better served by a “current-state view”). But as we continue along the exponential curve of technological progress, it would seem that these “current-state views” become less necessary. Instead we’re finding that storing more and more data in its time-series form often helps us to understand it better.
So is all data time-series data? I’ve yet to find a good counter example. If you’ve got one, I’m open to hearing it. Regardless, one thing is clear: time-series data already surrounds us. It’s time we put it to use.
Mining for treasure with time-series analysis
Hopefully by now your wheels are turning and you’ve started to identify applications or areas in your business that have time-series data just waiting for you to do something with it. So, now what?
This is when the fun (and real work) begins. It’s also when you’ll really see why time-series databases are essential tools.
Let’s look at an example based on the fictional web application we’ve referenced throughout this post. As we discussed, until now we’ve only tracked the last time a user logged in as a field in the “users” table and always update the previously stored value with the new login information. While this allows us to query how many people have logged in over a week or a month, we’re unable to analyze how often they log in, for how long, or drill into any other aspects that might tell us more about our users’ experience or their usage patterns.
We can quickly improve upon this by tracking information about every login, not just the most recent one. To do this, we’ll start logging the timestamp of each login and the type of device used to access our application (e.g., phone, tablet, desktop). This small change - tracking just one more property about the user login experience - provides immediate value, allowing us to answer questions like, “what kind of devices are most frequently used (by individual users and across all users)?” and “what time of day are users the most active?”. From there, we can better inform the features we prioritize - such as mobile-specific capabilities -, the times we display certain promotional messages, and beyond.
To track this new data, we add a new table called “user_logins” that references our “users” table. Here’s an example of what the data might look like:
With the updated data model and these new user details logged, we can start to query the data for insights. As mentioned earlier, time-series databases like TimescaleDB help with this kind of information in two crucial ways:
First, as your application scales and data volume grows, your database is built to handle and ingest the relentless stream of data inherent to time-series workloads, mitigating any negative performance impacts or lags.
Second, they provide specialized functions that make it easier - and faster - to query aspects of your data in meaningful ways where time is a primary component.
To demonstrate some of those specialized time-series analysis capabilities, let’s look at a few example functions that TimescaleDB adds to the SQL language – and how we can use them to better analyze our users’ usage behavioral patterns. (For more examples, see our advanced analytical functions documentation.)
In each example, we’re still relying on standard SQL patterns, a language that many developers are familiar with, and augmenting it for time-series use cases. WHERE clauses still work, and we can still aggregate data easily with GROUP BY clauses. But now, rather than having to parse out specific parts of the dates in order to group the data appropriately (for instance), we can use a function like time_bucket() to easily aggregate data across almost any interval.
And, as a bonus, it also makes the query easier to read!
Query #1: How many logins per day for the last month?
SELECT time_bucket(‘1 day’, login_timestamp) as one_day COUNT(*) total_logins FROM user_logins WHERE login_timestamp > now() - INTERVAL ‘1 month’ GROUP BY one_day ORDER BY one_day;
This first example is the “Hello, World!” of time-series queries, using the time_bucket() function to automatically group and aggregate our time-series data to help us get a quick view of total daily logins (‘1 day’ in the function above) for the last month (‘WHERE login_timestamp > now() - INTERVAL ‘1 month’). Notice that time-series queries allow you to specifically query intervals of time rather than breaking down dates into each component (month, day, year, hour, etc.) to do a similar aggregation without these specialized functions.
Query #2: What was the last login time of each user and what type of device did they use?
SELECT user_id, first_name || ‘ ‘ || last_name AS full_name, last(login_timestamp, login_timestamp) AS last_login, last(device_type, login_timestamp) AS last_device_type FROM user_logins ul INNER JOIN users u on ul.user_id = u.user_id WHERE login_timestamp > now() - INTERVAL ‘1 month’ GROUP BY user_id, full_name ORDER BY user_id;
In this more complex example, we use another specialized function, last(), to query useful information about our users, specifically the most recent value of a specific set of data. Without a specialized function like last(), we would need to write a query with something like a LATERAL JOIN or a correlated subquery. But, with our handy built-in specialized function, we’re able to get this type of valuable information in a straightforward (and often very quick) way.
Query #3: For the last week, which 6 hour periods saw the most log-ins from users on tablet devices?
SELECT time_bucket(‘6 hours’, login_timestamp, timestamptz ‘2020-01-01 08:00:00’) as device_bucket, device_type, count(*) AS logins_by_device, FROM user_logins WHERE login_timestamp > now() - INTERVAL ‘1 week AND device_type = ‘tablet’ GROUP BY device_bucket, device_type ORDER BY logins_by_device desc;
In this final example query, we demonstrate how functions like `time_bucket()` aren’t bound to common intervals (‘1 hour’, ‘1 day’, ‘1 week’, etc.), but can be used for INTERVAL grouping. And, more notably, we can combine these functions with parameters that allow us to refine our results to a specific subset. In this case, we asked TimescaleDB to return results in 6 hour buckets, aligning the first bucket to 8 AM UTC, and only return log-ins from tablet-based sessions.
These examples just scratch the surface; you have infinite flexibility in how your data can be queried and modeled.
In summary, logging just two additional details about user logins - device type and timestamps for every log-in, not just the latest - quickly transforms our ability to understand how our web application is used – and how time-series databases like TimescaleDB help us analyze and make sense of data, so we can make decisions faster.
Now, it’s your turn: resources to get started
If you’re convinced you need a time-series database, or just want to try it out for yourself, spin up a fully-managed TimescaleDB instance - free for 30 days.
From there, follow our intro tutorial to configure your database and execute your first query, then tackle more advanced time-series analysis with our cryptocurrency and time-series forecasting tutorials.
Have questions or want to learn more? Join our Slack community, where you’ll find myself, Timescale engineers, and community members active in all channels.