A primer on time-series data, and why you may not want to use a “normal” database to store it.
Here’s a riddle: what do self-driving Teslas, autonomous Wall Street trading algorithms, smart homes, transportation networks that fulfill lightning-fast same-day deliveries, and an open-data-publishing NYPD have in common?
For one, they are signs that our world is changing at warp speed, thanks to our ability to capture and analyze more and more data in faster and faster ways than before.
However, if you look closely, you’ll notice that each of these applications requires a special kind of data:
- Self-driving cars continuously collect data about how their local environment is changing around them.
- Autonomous trading algorithms continuously collect data on how the markets are changing.
- Our smart homes monitor what’s going on inside of them to regulate temperature, identify intruders, and respond to our beck-and-call (“Alexa, play some relaxing music”).
- Our retail industry monitors how their assets are moving with such precision and efficiency that cheap same-day delivery is a luxury that many of us take for granted.
- The NYPD tracks its vehicles to allow us to hold them more accountable (e.g., for analyzing 911 response times).
These applications rely on a form of data that measures how things change over time. Where time isn’t just a metric, but a primary axis. This is time-series data and it’s starting to play a larger role in our world.
Software developer usage patterns already reflect this. In fact, over the past 24 months time-series databases (TSDBs) have steadily remained the fastest growing category of databases:
As the developers of an open source time-series database, my team and I are often asked about this trend. So I’ll start with a more in-depth description of time-series data and then jump into when would you would need a time-series database.
What is time-series data?
Some think of “time-series data” as a sequence of data points, measuring the same thing over time, stored in time order. That’s true, but it just scratches the surface.
Others may think of a series of numeric values, each paired with a timestamp, defined by a name and a set of labeled dimensions (or “tags”). This is perhaps one way to model time-series data, but not a definition of the data itself.
Here’s a basic illustration. Imagine sensors collecting data from three settings: a city, farm, and factory. In this example, each of these sources periodically sends new readings, creating a series of measurements collected over time.
Here’s another example, with real data from the City of New York, showing taxicab rides for the first few seconds of 2018. As you can see, each row is a “measurement” collected at a specific time:
There are many other kinds of time-series data. To name a few: DevOps monitoring data, mobile/web application event streams, industrial machine data, scientific measurements.
These datasets primarily have 3 things in common:
- The data that arrives is almost always recorded as a new entry
- The data typically arrives in time order
- Time is a primary axis (time-intervals can be either regular or irregular)
In other words, time-series data workloads are generally “append-only.” While they may need to correct erroneous data after the fact, or handle delayed or out-of-order data, these are exceptions, not the norm.
You may ask: How is this different than just having a time-field in a dataset? Well, it depends: how does your dataset track changes? By updating the current entry, or by inserting a new one?
When you collect a new reading for sensor_x, do you overwrite your previous reading, or do you create a brand new reading in a separate row? While both methods will provide you the current state of the system, only by writing the new reading in a separate row will you be able to track all states of the system over time.
Simply put: time-series datasets track changes to the overall system as INSERTs, not UPDATEs.
This practice of recording each and every change to the system as a new, different row is what makes time-series data so powerful. It allows us to measure change: analyze how something changed in the past, monitor how something is changing in the present, predict how it may change in the future.
Put simply, here’s how I like to define time-series data: data that collectively represents how a system/process/behavior changes over time.
This is more than just an academic distinction. By centering our definition around “change”, we can start to identify time-series datasets that we aren’t collecting today, but that we should be collecting down the line. In fact, often people have time-series data but don’t realize it.
Imagine you maintain a web application. Every time a user logs in, you may just update a “last_login” timestamp for that user in a single row in your “users” table. But what if you treated each login as a separate event, and collected them over time? Then you could: track historical login activity, see how usage is (in-/de-)creasing over time, bucket users by how often they access the app, and more.
This example illustrates a key point: by preserving the inherent time-series nature of our data, we are able to preserve valuable information on how that data changes over time. Another point: event data is also time-series data.
Of course, storing data at this resolution comes with an obvious problem: you end up with a lot of data, rather fast. So that’s the catch: time-series data piles up very quickly.
Having a lot of data creates problems when both recording it and querying it in a performant way, which is why people are now turning to time-series databases.
Why do I need a time-series database?
You might ask: Why can’t I just use a “normal” (i.e., non-time-series) database?
The truth is that you can, and some people do. Yet why are TSDBs the fastest growing category of databases today? Two reasons: (1) scale and (2) usability.
Scale: Time-series data accumulates very quickly. (For example, a single connected car will collect 4,000 GB of data per day.) And normal databases are not designed to handle that scale. Relational databases fare poorly with very large datasets; NoSQL databases fare better at scale, but can still be outperformed by a database fine-tuned for time-series data. In contrast, time-series databases (which can be based on relational or NoSQL databases) handle scale by introducing efficiencies that are only possible when you treat time as a first class citizen. These efficiencies result in performance improvements, including higher ingest rates, faster queries at scale (although some support more queries than others), and better data compression.
Usability: TSDBs also typically include functions and operations common to time-series data analysis such as data retention policies, continuous queries, flexible time aggregations, etc. Even if scale it not a concern at the moment (e.g., if you are just starting to collect data), these features can still provide a better user experience and make your life easier.
This is why developers are increasingly adopting time-series databases and using them for a variety of use cases:
- Monitoring software systems: Virtual machines, containers, services, applications
- Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies
- Asset tracking applications: Vehicles, trucks, physical containers, pallets
- Financial trading systems: Classic securities, newer cryptocurrencies
- Eventing applications: Tracking user/customer interaction data
- Business intelligence tools: Tracking key metrics and the overall health of the business
- (and more)
Even then, you’ll need to pick a time-series database that best fits your data model and write/read patterns.
A parting thought: Is all data time-series data?
For the past decade or so, we have lived in the era of “Big Data”, collecting massive amounts of information about our world and applying computational resources to make sense of it.
Even though this era started with modest computing technology, our ability to capture, store, and analyze data has improved at an exponential pace, thanks to major macro-trends: Moore’s law, Kryder’s law, cloud computing, an entire industry of “big data” technologies.
Now we need more. We are no longer content to just observe the state of the world, but we now want to measure how our world changes over time, down to sub-second intervals. Our “big data” datasets are now being dwarfed by another type of data, one that relies heavily on time to preserve information about the change that is happening.
Does all data start off as time-series data? Recall the earlier web application example: we had time-series data but didn’t realize it. Or think of any “normal” dataset. Say, the current accounts and balances at a major retail bank. Or the source code for a software project. Or the text for this article.
Typically we choose to store the latest state of the system, but instead, what if we stored every change and computed the latest state at query time? Isn’t a “normal” dataset just a view on top of an inherently time-series dataset (cached for performance reasons)? Don’t banks have transaction ledgers? (And aren’t blockchains just distributed, immutable time-series logs?) Wouldn’t a software project have version control (e.g., git commits)? Doesn’t this article have revision history? (Undo. Redo.)
Put differently: Don’t all databases have logs?
We recognize that many applications may never require time-series data (and would be better served by a “current-state view”). But as we continue along the exponential curve of technological progress, it would seem that these “current-state views” become less necessary. And that by storing more and more data in its time-series form, we may be able to understand it better.
So is all data time-series data? I’ve yet to find a good counter example. If you’ve got one, I’m open to hearing it. Regardless, one thing is clear: time-series data already surrounds us. It’s time we put it to use.