When Boring is Awesome: Building a scalable time-series database on PostgreSQL
(Update: Follow the discussion on this Hacker News thread.)
TimescaleDB is engineered up from PostgreSQL (packaged as an extension) and yet scales out horizontally, which means it supports normal SQL and all of the features you expect from a relational database: JOINs, secondary indexes, complex predicates and aggregates, window functions, CTEs, etc.
- Looks, feels, speaks just like PostgreSQL. Including a normal SQL (not “SQL-like”) query interface. Existing clients, connectors, and BI tools (e.g., Tableau) work out of the box. Designed to be stupidly easy to use. If you know SQL (and especially PostgreSQL), you already know TimescaleDB.
- Scalable. Everyone “knows” that RDBMSs like PostgreSQL do not scale well. We solve that problem, introducing horizontal scale-out, automatic space/time partitioning, and distributed query optimizations for time-series data. Our latest benchmarks consistently show a greater than 15x improvement on inserts versus vanilla PostgreSQL on a dataset of 250 million rows, achieving constant insert throughput as the database grows (135K writes per second per node, where each write is a row of 10 metrics). We expect the insert rate to stay in that ballpark as the dataset grows further.
- Reliable. Even though TimescaleDB is new, we benefit from 20+ years of work in PostgreSQL reliability and tooling. We stand on the shoulders of giants.
(For more technical details, please refer to our documentation.)
In this age of new and shiny open source data projects, something as old as PostgreSQL can seem boring. But sometimes boring is awesome, especially when it’s your database. TimescaleDB is designed to just work, not wake you up at 3am.
If you have any kind of time-series data, and you like SQL/PostgreSQL, then please give TimescaleDB a whirl and let us know how it goes. We appreciate any feedback (and we’re pretty friendly folks).
You can install TimescaleDB via Homebrew, Docker, or from source. More information on GitHub.
But isn’t there already a glut of time-series databases? Did we really have to build yet another one?
(Read on, padawan…)
Why build yet another time-series database?
Seems like time-series databases (i.e., databases optimized for data captured over time, for example, sensor data, financial data, DevOps data, etc.) are in vogue these days.
There have been a number of blog posts on the subject over the past few years (including these gems by Baron Schwartz (2014) and Jason Moiron (2015)), and a plethora of new open source time-series databases, each with their own trade-offs. (Here’s one list and its methodology.)
We can’t read minds, but we imagine that the developers behind each of those projects built their own time-series database because traditional RDBMS (e.g., PostgreSQL, MySQL) didn’t scale for their needs.
That was the same problem we faced a year ago, when we needed a database to store sensor data for the IoT platform we were building at the time. We loved PostgreSQL, but “knew” that it inherently wouldn’t scale for our needs.
We tested some of the options in the list above, and saw that they all scaled pretty well, but sacrificed query power in exchange, and failed to support a number of key SQL capabilities.
So we had a choice: scalability or query power. That made us sad. We needed both.
In particular, we needed:
- Scalable ingest, reads, and deletes. Some of our customers collected data at kilohertz, so we knew our dataset would get large very quickly. We needed inserts, reads, and deletes (in particular, bulk deletes) to perform well, even at scale. (On the other hand, we didn’t need to optimize for updates, as we expected to perform those rarely.)
- Performant complex queries. We needed support for complex predicates (e.g., readings where temperature and cpu are above certain thresholds), non-time based aggregates over a fixed window (e.g., total number of errors by firmware version), multiple aggregates (e.g., group by device type and time), flexible ordering (e.g., top 10 devices by usage), JOINs (e.g., join sensor data with device metadata), etc., at latencies that satisfied our APIs and dashboards.
- Multiple data type options. We wanted floats, integers, strings, booleans, arrays, JSON blobs. Support for geospatial data types as a bonus.
- An easy to use query language. We didn’t want to have to learn a completely new query language and build brand new connectors. We wanted something like SQL. Actually, we wanted SQL.
- A reliable and easy to operate database. We didn’t want to get woken up at 3am because our database crashed (and we sure didn’t want to lose data). We wanted something boring, something that would just work, not something fancy and experimental. We also wanted something with rich tooling and an abundant software ecosystem, so we wouldn’t always be waiting for (or needing to write) the next connector or integration for other systems we were using.
When we looked at this list, we realized that we needed something like PostgreSQL. In fact, if we could only solve the PostgreSQL scalability problem, we’d have the perfect time-series database: scalable, easy to use, and reliable. (PostgreSQL even has support for geospatial data types and queries, via PostGIS.)
Could this be possible? Being a group of computer science PhDs and academics (including one tenured Professor), we decided to find out for ourselves, and determined that the nature of time-series workloads lend themselves to a new database architecture that could offer both scale and SQL.
And then we built it.
And then we benchmarked it, and found that our database outperformed PostgreSQL by more than 15x on inserts on large datasets. In particular, we found that as the dataset size grows, the insert rate for PostgreSQL drops off dramatically, while our insert rate remains high:
- Each row contains 10 metrics and a timestamp.
- Each batch contains 10,000 rows written at once across any partitions (similar to what one would expect in production, e.g., when consuming data off of a message bus like Kafka).
- This is on an Azure standard DS4 v2 machine (8 cores), SSD (premium LRS storage).
Finally, scale and SQL. This made us happy.
(So why build yet another time-series database? Because we had to.)
Really, scale and normal SQL? (Demo)
Get ready for the world’s most boring database demo, because TimescaleDB’s query language is just normal SQL:
This obviously is just a sample. For the full documentation on what kinds of queries we support, please refer here.
What’s going on behind the scenes
“So what’s the big deal? This looks just like normal PostgreSQL…”
There are actually 5 key things happening behind the scenes:
- Automatic space-time partitioning: We take advantage of two major attributes of time-series workloads: that all data has a primary key and a timestamp, and that inserts are largely append-only (writes to most recent interval, infrequent updates). This allows us to automatically partition incoming data for a given table by time and space (primary key) into 2D “chunks” (stored internally as PostgreSQL tables), and create new chunks on demand as necessary, all done transparently to the user.
- Right-sized data chunks: Our engine ensures that chunks are right sized and time-interval aligned to ensure that the multiple B-trees for a table’s indexes can reside in memory during inserts to avoid thrashing. This also allows us to delete data by dropping entire chunks, rather than needing to delete individual rows, thus avoiding expensive vacuuming operations.
- Distributed query optimizations: When queries arrive, we avoid querying extraneous chunks via constraint exclusion analysis, and then employ different techniques to parallelize the query across the remaining chunks efficiently.
- The “hypertable” abstraction: All this complexity is hidden from the user behind an abstraction we call a “hypertable”, which provides the illusion of a single table across all space and time (despite the 2D chunking). Inserts are written to the hypertable; the database automatically partitions the data, writes to the appropriate chunk, and creates new chunks if necessary. Similarly, queries are run against the hypertable; the database automatically runs the distributed query optimizations, determines the minimally necessary set of chunks to query, pushes down any further optimizations to these chunks, and returns the appropriate data.
- Tight integration with PostgreSQL: All this occurs tightly integrated with the PostgreSQL query parser, which allows the database to support the entire spectrum of PostgreSQL commands and then run its own query planner and optimizations.
For example, each of the queries above is running against a hypertable, allowing the database to hide the complexity of the system from the user.
But this just scratches the surface. For more on our technical architecture, take a look at our documentation.
Why this matters: time-series data is sprouting up everywhere
“Fine”, you might say, “you guys built something that only works for your weirdo IoT backend.”
That’s what we thought too. But as we made the rounds talking about our IoT platform, people would respond: “We’re building our own IoT platform, so we can’t use yours. But, tell us more about this time-series database you built?”
Then, they’d add, “You know, forget IoT, we have a lot of time-series data in general. Could your database help there too?”
And strangely, we heard the same thing from our friends in other industries: they had a growing amount of time-series data and needed something better than existing databases.
Eureka. Our time-series database was solving a bigger problem.
We realized that time-series data, which used to be this niche thing within finance and DevOps, was sprouting up everywhere. We realized that fundamental shifts in computing — more sources of data, fatter pipes, cheaper storage — were creating new currents of time-series data streams. And that analyzing these new datasets across time was powerful, enabling us to monitor the present, understand historical trends, troubleshoot the past, predict the future.
We also noticed that even traditional time-series data applications were becoming more complex: e.g., in DevOps, needing to correlate application performance across microservices; in finance, needing to monitor payment transactions and other customer interaction data in real-time.
There was also another trend at work: the resurgence of SQL. Recent posts like these from Percona (March 27, 2017), Baron Schwartz (March 19, 2017), and Paris Kasidiaris (March 13, 2017) capture the sentiment well. The pendulum is swinging back towards boring SQL. In fact, “NoSQL” databases seem to be rebranding themselves to mean “not only SQL”, rather than outright rejecting SQL.
We realized that our database, which sat at the intersection of the “rise of time-series data” and the “resurgence of SQL”, might actually be useful to other people.
That’s why we decided last fall to change directions and go all in on the database. After several months of heads-down work, we just open sourced it last month under the Apache 2 license.
When you might want to consider alternatives
That said, TimescaleDB can’t solve everyone’s problems. In particular, there are 3 time-series scenarios where there may be better alternatives:
- Simple read requirements: When most of your query patterns are simple in nature (e.g., simple key-value lookups, or rollup of a single metric over time).
- Low available storage: When resource constraints place storage at a premium, and heavy compression is required, even at the cost of query power and ease of use. (Although this is an area of active development, and we expect TimescaleDB to improve.)
- Sparse and/or unstructured data: When your time-series data is especially sparse and/or generally unstructured. (But even if your data is partially structured, TimescaleDB includes a JSONB field type for the unstructured part(s). This allows you to maintain indexes on the structured parts of your data combined with the flexibility of unstructured storage.)
TimescaleDB is the first open source time-series database that offers normal SQL at scale. It acts like a relational database yet scales linearly for time-series data.
TimescaleDB is in active development by a team of PhDs based in New York City, Stockholm, and Los Angeles. A single-node version is currently available for download. A clustered version is in the works.
We scratched our own itch, and hope it now scratches yours. If it does, or you think it might and want to learn more, we’d love to hear from you at email@example.com.