A look into what causes high-cardinality, and how time-series databases handle this problem.
If you are working with a database, especially with time-series data, then you have likely faced the challenge of handling high-cardinality data.
In particular, time-series high-cardinality is a common problem in industrial IoT (e.g., manufacturing, oil & gas, utilities), as well as some monitoring and event data workloads.
High-cardinality is also a topic that we (as developers of a time-series database) end up discussing a lot, and it continues to surprise us how much confusion there is around it. Given that, we wanted to share our experience with this problem.
But before we begin, let’s clear up one common point of confusion: high-cardinality has only become such a big issue in the time-series world because of the limitations of some popular time-series databases. In reality, high-cardinality data is actually a solved problem, if one chooses the right database.
For example, here is how TimescaleDB and InfluxDB perform on ingesting data as the cardinality of your dataset grows:
(And to avoid any doubt, yes we are the creators of TimescaleDB, but in this post we aim to be impartial and honest. For example, our InfluxDB benchmarks actually outperform what InfluxData posts on their own website.)
Later in this post we’ll explain why the two databases differ so much, and how we generated these results.
But first, let’s talk about high-cardinality.
What is high-cardinality?
Broadly defined, cardinality refers to the number of values in a set. Sometimes the cardinality of your set is small (low cardinality), and other times it can be large (high cardinality). For example, there are quite a few (delicious) M&Ms in our image above, but the cardinality of that dataset is quite small (6):
In the world of databases, cardinality refers to the number of unique values contained in a particular column, or field, of a database.
However, with time-series data, things get a bit more complex.
Time-series data tends to be paired with metadata (sometimes called “tags”) that describes that data. Often that primary time-series data or the metadata is indexed for faster query performance, so that you can quickly find the values that match all of the specified tags. The cardinality of a time-series dataset is typically defined by the cross-product of the cardinality of each individual indexed column. So if there are 6 colors of M&Ms, but also 5 types of M&Ms (plain, peanut, almond, pretzel, and crispy), then there our cardinality is now 6x5 = 30 total options for M&Ms. Having the right indexes would then allow us to efficiently find all blue, crispy M&Ms (which are objectively the best).
If you have multiple indexed columns, each with a large number of unique values, then the cardinality of that cross-product can get really large. That’s what software developers typically mean when they talk about a time-series dataset with “high-cardinality.”
Let’s look at an example.
High-cardinality example: Industrial IoT
Imagine an IoT scenario where there are large, heavy pieces of equipment mining rock, crushing rock, and sorting rock in a certain quarry.
Say there are 10,000 pieces of equipment, each with 100 sensors, running 10 different firmware versions, spread across 100 sites:
timestamp | temper | mem_f | equipm | senso | firmwar | sit | (lat,long) | ature | ree | ent_id | r_id | e_version | e_id | --------------------+-------+--------+-------------------+------+----------- 2019-04-04 | 85.2 | 10.2 | 1 | 98 | 1.0 | 4 | (x,y) 09:00:00 | | | | | | | 2019-04-04 | 68.8 | 16.0 | 72 | 12 | 1.1 | 20 | (x1,y1) 09:00:00 | | | | | | | 2019-04-04 | 100.0 | 0.0 | 34 | 58 | 2.1 | 55 | (x2,y2) 09:00:00 | | | | | | | 2019-04-04 | 84.8 | 9.8 | 12 | 75 | 1.4 | 81 | (x3,y3) 09:00:00 | | | | | | | 2019-04-04 | 68.7 | 16.0 | 89 | 4 | 2.1 | 13 | (x4,y4) 09:00:00 | | | | | | | ... | | | | | | |
The maximum cardinality of this dataset then becomes 1 billion [10,000 x 100 x 10 x 100].
Now imagine that the equipment can move as well, and we’d like to store the precise GPS location (lat, long), and use that as indexed metadata to query by. Because (lat, long) is a continuous field (as opposed to a discrete field like equipment_id), by indexing on location the max cardinality of this dataset is now infinitely large (unbounded).
How time-series databases InfluxDB and TimescaleDB handle high-cardinality
Given how common high-cardinality datasets are within time-series, let’s take a look at how two time-series databases, InfluxDB and TimescaleDB, handle this issue.
InfluxDB is a NoSQL database for which its creators have chosen to rebuild everything from scratch. In contrast, TimescaleDB is a SQL database for which its creators (namely, your authors) have chosen to embrace and build on top of PostgreSQL and proven data structures, and then further extend it for time-series problems.
First, here is a comparison on how the two databases perform on inserts as the cardinality of the dataset increases.
Note: for the below comparison we used the following setup:
- TimescaleDB version 1.2.2, InfluxDB version 1.7.6
- 1 remote client machine, 1 database server, both in the same cloud datacenter
- AWS EC2 instance: i3.xlarge (4 vCPU, 30GB memory)
- 4 1-TB disks in a raid0 configuration (EXT4 filesystem)
- Both databases were given all available memory
- Dataset: 100–1,000,000 simulated devices generated 1–10 CPU metrics every 10 seconds for ~100M reading intervals, ~1B metrics (1 month interval for 100 devices; 3 days for 4,000; 3 hours for 100,000; 3 minutes for 1,000,000), generated with the Time Series Benchmark Suite (TSBS).
- Schemas used for TimescaleDB (1) and InfluxDB (2)
- 10K batch size was used for both on inserts
- For TimescaleDB, we set the chunk size depending on the data volume, aiming for 10-15 chunks (more here)
- For InfluxDB, we enabled the TSI (time series index)
(1) TimescaleDB schema: Table cpu(time timestamp, tags_id integer, usage_user double, usage_system double, usage_idle double, usage_nice double, usage_iowait double, usage_irq double, usage_softirq double, usage_steal double, usage_guest double, usage_guest_nice double, additional_tags jsonb); Index on (tags_id, time) and (time, tags_id); Table tags(id integer, hostname text, region text, datacenter text, rack text, os text, arch text, team text, service text, service_version text, service_environment text) Unique index on all the columns together
(2) InfluxDB schema: Field Keys(usage_guest integer, usage_guest_nice integer, usage_idle integer, usage_iowait integer, usage_irq integer, usage_nice integer, usage_softirq integer, usage_steal integer, usage_system integer, usage_user integer), Tag Keys(arch, datacenter, hostname, os, rack, region, service, service_environment, service_version, team)
Note: a more detailed overall comparison of these two databases can be found here.
As you can see, at low cardinality both databases are comparable (although TimescaleDB outperforms by 30%). But as cardinality increases, the difference is quite remarkable, as the insert performance for TimescaleDB degrades far more slowly than that of InfluxDB, which falls rather precipitously. At high cardinalities, TimescaleDB outperforms InfluxDB by over 11x.
But why does this happen? Let’s take a closer look at how these two databases are being developed.
B-trees vs. the TSI: Two different approaches for handling high-cardinality
We can trace the difference in high-cardinality performance to fundamentally different engineering decisions in InfluxDB vs. TimescaleDB.
InfluxDB and the TSI
Since high-cardinality has been a well-known challenge for InfluxDB, their team has been working on something they call the “Time Series Index” (TSI) to address this problem.
Consistent with their approach in other areas, the InfluxDB TSI is a home-grown log-structured merge tree based system comprised of various data structures, including hashmaps and bitsets. This includes: an in-memory log (“LogFile”) that gets periodically flushed to disk when it exceeds a threshold (5MB) and compacted to an on-disk memory-mapped index (“IndexFile”); a file (“SeriesFile”) that contains a set of all series keys across the entire database. (Described here in their documentation.)
The performance of the TSI relies on the interactions of all of these data structures. However, because the TSI is custom-built, understanding how it performs under various high-cardinality workloads becomes difficult to understand.
The design decisions behind the TSI also leads to a few limitations with performance implications:
- That total cardinality limit, according to the InfluxDB documentation, is around 30 million (although based on the graph above, InfluxDB starts to perform poorly well before that), or far below what is often required in time-series use cases like IoT (including our example above).
- InfluxDB indexes tags but not fields, which means that certain queries can not perform better than full scans. So, using our earlier IoT dataset as an example, if one wanted to search for all rows where there was no free memory (e.g, something like,
SELECT * FROM sensor_data WHERE mem_free = 0), one could not do better than a full linear scan (i.e., O(n) time) to identify the relevant data points.
- The set of columns included in the index is completely fixed and immutable. Changing what columns in your data are indexed (tagged) and what things are not requires a full rewrite of your data.
- InfluxDB is only able to index discrete, and not continuous, values due to its reliance on hashmaps. For example, to search all rows where temperature was greater than 90 degrees (e.g., something like
SELECT * FROM sensor_data WHERE temperature > 90), one would again have to fully scan the entire dataset.
- Your cardinality on InfluxDB is affected by your cardinality across all time, even if some fields/values are no longer present in your dataset. This is because the SeriesFile stores all series keys across the entire dataset.
TimescaleDB and B-trees
In contrast, TimescaleDB is a relational database that relies on a proven data structure for indexing data: the B-tree. This decision leads to its ability to scale to high cardinalities.
First, TimescaleDB partitions your data by time, with one B-tree mapping time-segments to the appropriate partition (“chunk”). All of this partitioning happens behind the scenes and is hidden from the user, who is able to access a virtual table (“hypertable”) that spans all of their data across all partitions.
Next, TimescaleDB allows for the creation of multiple indexes across your dataset (e.g., for equipment_id, sensor_id, firmware_version, site_id). These indexes are then created on every chunk, by default in the form of a B-tree. (One can also create indexes using any of the built-in PostgreSQL index types: Hash, GiST, SP-GiST, GIN, and BRIN.)
This approach has a few benefits for high-cardinality datasets:
- The simpler approach leads to a clearer understanding of how the database performs. As long as the indexes and data for the dataset we want to query fit inside memory, which is something that can be tuned, cardinality becomes a non-issue.
- In addition, since the secondary indexes are scoped at the chunk level, the indexes themselves only get as large as the cardinality of the dataset for that range of time.
- You have control over which columns to index, including the ability to create compound indexes over multiple columns. You can also add or delete indexes anytime you want, for example if your query workloads change. Unlike in InfluxDB, changing your indexing structure in TimescaleDB does not require you to rewrite the entire history of your data.
- You can create indexes on discrete and continuous fields, particularly because B-trees work well for a comparison using any of the following operators: <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL. Our example queries from above (
SELECT * FROM sensor_data WHERE mem_free = 0and
SELECT * FROM sensor_data WHERE temperature > 90) will run in logarithmic, or O(log n), time.
- The other supported index types can come in handy in other scenarios, e.g., GIST indexes for “nearest neighbor” searches.
Where to go from here
As demonstrated above, you can easily have a time-series dataset with a high cardinality (e.g., in IoT). However, this doesn’t have to be a problem as long as you choose a time-series database that is designed to handle high cardinality.
If this is a problem that you are wrestling with, then please give TimescaleDB a try and let us know how we can help. If you are already using InfluxDB and want to test how TimescaleDB performs with your workload, we’ve built a tool to help you do that.