How to Store Time-Series Data in MongoDB and Why That’s a Bad Idea

How to Store Time-Series Data in MongoDB and Why That’s a Bad Idea

An insert performance up to 260 % higher, 54x faster queries, and simpler implementation when using TimescaleDB vs. MongoDB for time-series data.

Time-series data has exploded in popularity, and the value of tracking and analyzing how things change over time has become evident in every industry: DevOps and IT monitoring, industrial manufacturing, financial trading and risk management, sensor data, ad tech, application eventing, smart home systems, autonomous vehicles, and more.

But 2020 has provided us with the most personal example of how time-series data collection and analysis affects our daily lives, with billions of people across the globe becoming relentless consumers of time-series data, demanding accurate and timely information about the daily trend of various COVID-19 statistics.

The biggest challenge with storing time-series data? Scale, both in collecting data and storing it. Data accumulates quickly and requires a database that can keep up with a relentless stream of data from the systems you care about. We’ve shown previously that SQL and relational databases can reach petabyte-scale and beyond, but many developers' first inclination still goes to using a NoSQL database for their time-series data when scale is a requirement (perhaps due to the breakthroughs made by NoSQL databases in the early 2000s?)

Enter MongoDB as a time-series solution. MongoDB is among the best-known NoSQL databases, emerging at the end of the last decade to become the face of NoSQL and the foundation of a nearly $21 billion company (as of writing). MongoDB became popular as a simple document store for quickly prototyping and easily scaling web apps.

But, over the years, developers have started using MongoDB to fill all sorts of database needs across various domains, including using MongoDB to store and analyze time-series data at scale.

Here are a few examples of posts we’ve found on the topic of storing time-series data in MongoDB, with sources ranging from the official MongoDB blog to popular technical how-to sites, like Dev.to and Quora:

Part 2 of a series on storing time-series data in MongoDB on the official MongoDB blog.
Part 2 of a series on storing time-series data in MongoDB on the official MongoDB blog.
Post on dev.to about how to implement time series in MongoDB.
Post on dev.to about how to implement time-series in MongoDB.
Post on Quora, the popular Q&A platform, about the best way to store time-series data in MongoDB
Post on Quora, the popular Q&A platform, about the best way to store time-series data in MongoDB.

But is MongoDB really the right solution for time-series data? We decided to evaluate it for ourselves, with the obvious caveat that we are the creators of a competing product. Yet, you’ll see we try to keep our analysis as fair as possible, trying multiple approaches to storing time-series data in MongoDB.

Our conclusion is that while MongoDB’s JSON-like document store may make it a jack-of-all-trades type of database and perhaps a master of some (e.g., web applications), time series is not one of them. You’re better off going with a purpose-built time-series database for both performance and ease of use.

Our Evaluation: TimescaleDB vs MongoDB

For this analysis, we evaluated MongoDB vs. TimescaleDB, the leading open-source relational database for time-series data (and our own product). Engineered atop PostgreSQL, TimescaleDB is designed for fast ingest, complex queries, and ease of use, with powerful tools for analysis, retention, and management of time-series data. Given its PostgreSQL foundation, it inherits the rock-solid reliability, tooling, and vast ecosystem of Postgres, as well as SQL, the commonly known, well-documented query language that’s popular for data analysis.

We evaluated two methods of using MongoDB as a time-series database:

  1. “Mongo-naive”: a naive, document-per-event method.
  2. “Mongo-recommended”: a method recommended by MongoDB users and MongoDB itself that aggregates events into hourly documents.

We compared MongoDB and TimescaleDB across several dimensions:

  • Insert (write) performance
  • Query (read) performance
  • Disk storage footprint
  • Implementation cost and code maintenance

Here’s a summary of our results. We go into much more detail about our methodology later in the post:

The first method, which we’ll call “Mongo-naive” throughout this post, has fast writes and is extremely simple to implement but offers dismal query performance, even on simple analytical queries.

The second method, which we’ll call “Mongo-recommended” throughout this post, demonstrates 550-6800% of the query performance of Mongo-naive (method 1). This performance difference makes it clear to see why it is the recommended method for storing time-series data in MongoDB, despite having slower ingest performance (1.5x slower) and more disk usage (16% more) than Mongo-naive.

However, this method has worse write performance, higher storage footprint, and higher implementation complexity than both Mongo-naive and TimescaleDB. Mongo-recommended also fails to deliver good query performance on more complex time-series queries compared to TimescaleDB.

More specifically, compared to MongoDB, TimescaleDB exhibits:

  • 169 % (vs. method 1) to 260 % (vs. method 2) regarding write performance
  • 10x smaller disk storage footprint than both MongoDB methods, using 9 % of the storage footprint of Mongo-naive and 8 % of the storage footprint of Mongo-recommended.
TimescaleDB has 10x smaller disk storage footprint than both MongoDB methods.
TimescaleDB has 10x smaller disk storage footprint than both MongoDB methods.
  • 200% to 5400% faster queries, performing considerably faster on complex queries, which are commonly used to analyze and monitor devices for DevOps and IoT use cases
TimescaleDB showerd 200% to 5400% faster queries than MongoDB during our benchmark
TimescaleDB showed 200% to 5400% faster queries than MongoDB during our benchmark evaluation.
  • Simpler implementation and code maintenance, especially when compared to method 2, the method MongoDB itself recommends.

Want to learn more about TimescaleDB?

Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️  are  appreciated!)

(P.S. If you’re looking for more comparisons of database performance for time-series data, read our studies comparing  AWS Timestream vs. TimescaleDB and InfluxDB vs. TimescaleDB.)

In the remainder of this post, we’ll outline the methodology and results used to conduct a detailed set of benchmarks that compare TimescaleDB vs. MongoDB across inserts, queries, and ease of use.

MongoDB Time-Series Methods

As mentioned, we tested two methods for storing time-series data in MongoDB, and before diving into write and read performance numbers, let’s take a moment to examine each method in a bit more detail.

We’ll use the common time-series scenario of DevOps metrics (in this case, storing CPU metrics) to demonstrate how to store time-series data in MongoDB.

Method 1: Document per event (a.k.a. “Mongo-naive”)

The first method, which we’ll refer to as “Mongo-naive,” is straightforward: each time-series reading is stored as a document. This is the first approach that would come to mind for most developers when trying to store time-series data in MongoDB, hence the name "naive".

So, for our given use case (see setup below) of monitoring CPU metrics, the JSON document looks like this:

{
    “measurement” : “cpu”,
    “timestamp_ns” : NumberLong(“1451606420000000000”),
    “fields” : {
        “usage_user” : 98.15664166391042,
        “usage_guest_nice” : 85.53066998716453,
        …
    },
    “tags” : {
        “hostname” : “host_2019”,
        “datacenter” : “us-east-1b”,
        …
    }
}

Conceptually and in implementation, this method is very simple, so it seems like a tempting route to go: batch all measurements that occur at the same time into one document along with their associated tags and store them as one document.

Indeed, this approach yields very good write performance, and it is fairly easy to implement (in our tests, we saw noticeably better results than InfluxDB reports in their comparison). However, we’ll see later that, even with indexes, the query performance with this method leaves much to be desired.

This second method is one that MongoDB users and the company itself recommend when it comes to time series, which we call “Mongo-recommended.”  The engineering idea behind this method is clever: create a document for each device, for each hour, which contains a 60 by 60 matrix representing every second of every minute in that hour. This document is then updated each time a new reading comes in, rather than doing a new document insert:

{
  "measurement" : "cpu",
  "doc_id" : "host_15_20160101_00",  // for quick update lookup
  "key_id" : "20160101_00",  // YYYYMMDD_hh
  "tags" : {
    "hostname" : "host_6",
    "datacenter" : "ap-southeast-1b",
    ...
  },
  "events" : [
    [
      {
        "timestamp_ns" : NumberLong("1451606420000000000"),
        "usage_user" : 98.15664166391042,
        "usage_guest_nice" : 85.53066998716453,
        ...
      },
      ... // (59 elements elided)
    ],
    ... // (59 elements elided)
  ]
}

This method makes it possible to do some efficient filtering when it comes to queries but comes with a more cumbersome implementation and decreased (albeit not terrible) write performance. For example, to efficiently manage writes, the database must keep a client-side cache of which documents are already made so that a more costly “upsert” (i.e., insert if it doesn’t exist; otherwise, update) pattern is not needed.

Further, while queries for this method are typically more performant, we found that designing the query in the first place requires more effort than method 1, especially when reasoning about which aggregate documents can be filtered/pruned.

Finally, this approach limits the granularity of your data. In particular, if you wanted to support millisecond precision, you would have to change the design to aggregate on a minute basis, as the max document size in MongoDB (16MB) does not lend itself to further nesting.

Beyond millisecond precision is probably infeasible, as MongoDB’s document size limitation means that you would probably need to create a document per second for each device, and the nesting required would make the query construction process extremely complex.

Because the data in our evaluation was at the granularity of seconds, not milliseconds, and given the query performance we saw (as detailed in the next section), we ultimately decided that this method is probably the best method for comparison against TimescaleDB. (We include Mongo-naive write and read performance numbers  to show how we reached this conclusion.)

The Horse Race: TimescaleDB vs. MongoDB

Once again, we’ll use the common time-series scenario of DevOps metrics to benchmark the performance of the two MongoDB methods and TimescaleDB.

Note: We've released all the code and data used for the benchmarks below as part of the open-source Time Series Benchmark Suite (TSBS).

Machine configuration

For comparing both insert and read latency performance, we used the following setup:

  • Version: TimescaleDB version 1.7.1 community edition, with PostgreSQL 12, MongoDB 4.2.8 community edition. These were the latest production releases of both databases as of July 2020, at the time of testing.
  • 1 remote client machine, 1 database server, both in the same cloud data center.
  • Instance size: both client and database server ran on DigitalOcean virtual machines (droplets) with 32 vCPU and 192 GB of memory each.
  • OS: Both server and client machines ran Ubuntu 18.04.3
  • Disk Size: 4 TB of remote SSD storage in a raid0 configuration (EXT4 filesystem), plus 800 GB of local SSD storage.
  • Deployment method: database servers were deployed using Docker images, using images pulled from the official Docker hubs of MongoDB and Timescale, respectively.
  • Memory: both databases were given all available memory.

Benchmarking dataset

To benchmark both write and read performance, we used the following dataset:

  • Dataset: 4,000 simulated devices generated 10 CPU metrics every 10 seconds for 4 full days (~100M+ reading intervals, ~1B+ metrics)
  • Dataset generated with the Time Series Benchmarking Suite, using the cpu-only use case
Details of the Devops dataset used to benchmark both write (ingest) and read (query) performance for MongoDB and TimescaleDB.
Details of the DevOps dataset used to benchmark both write (ingest) and read (query) performance for MongoDB and TimescaleDB.
  • Batch size: Inserts were made using a batch size of 10,000

Additional database configurations:

  • For TimescaleDB, we set the chunk size to 12 hours, resulting in 6 total chunks (learn more about chunks and chunk time here).
  • We also enabled native compression on TimescaleDB. We compressed all data older than 12 hours, resulting in 7 compressed chunks and 1 (data from the last 12 hours) uncompressed chunk. This configuration enables greater query efficiency and the ability to handle out-of-order data since raw data is kept for recent time periods and older data is compressed (see our compression docs for more). Here are the compression parameters used: we segmented by the tags_id and hostname columns and ordered by time descending and usage_user columns.

Write Performance and Disk Usage

Insert rate comparison: The simplicity of Mongo-naive led it to outperform Mongo-recommended by 54%, but TimescaleDB outperforms both methods - achieving 169% (vs Mongo-naive) and 260% (Mongo-recommended) of the performance of MongoDB
Insert rate comparison: The simplicity of Mongo-naive led it to outperform Mongo-recommended by 54%, but TimescaleDB outperforms both methods - achieving 169% (vs. Mongo-naive) and 260% (Mongo-recommended) of the performance of MongoDB.

Because NoSQL databases typically trade off some guarantees of relational databases, one might expect MongoDB to achieve better write performance/throughput, making it an inviting choice for ingesting time-series data, which can be at a rate of thousands of readings per second (or more).

While it’s true that plain PostgreSQL does tend to lose write throughput as the dataset size grows (See our PostgreSQL vs. TimescaleDB benchmark results for more), TimescaleDB’s unique chunking mechanism keeps write performance high (learn more about chunks).

As a result, TimescaleDB outperformed both MongoDB configurations by a significant margin: TimescaleDB saw 169 % better insert performance compared to the Mongo-naive method and 260 % better write performance compared to the MongoDB-recommended.

💡
Note: We achieved these insert numbers using 32 concurrent clients inserting data into each setup. If you want to replicate this experiment, ensure your client machine has enough cores to execute this parallelism.

Using parallel writes is a general best practice for improving write performance, as INSERTs and COPYs are generally executed as a single transaction and thus run in a single-threaded fashion (as is the case with PostgreSQL and TimescaleDB). For more guidance and details on achieving high ingest rates for time-series data, see our 13 ways to improve TimescaleDB and PostgreSQL write performance.

All three setups achieve write performance of over 1 million metrics per second. However, only TimescaleDB seems suitable for performance-critical, time-series use cases, as it achieved very high ingest rates on the high cardinality benchmark dataset, with an average insert rate of 2.7 million metrics per second. The sluggishness of the Mongo-recommended method’s ingest rate is likely due to the extra cost of occasionally creating new, larger documents (e.g. when a new hour or device is encountered).

Given that time-series data piles up quickly and the costs associated with storing large amounts of data, it’s also worth exploring the disk storage footprint of each database configuration:

Timescale uses 10x less disk space to store the same number of metrics than both MongoDB configurations
Timescale uses 10x less disk space to store the same number of metrics than both MongoDB configurations.

TimescaleDB uses 9 % of the disk space of the Mongo-naive and 8 % of the disk space of Mongo-recommended methods. This is thanks to the TimescaleDB’s novel hybrid row/columnar storage approach, which uses Gorilla compression for floats; delta-of-delta and simple-8b with run-length encoding for timestamps and integer-like types; whole-row dictionary compression for a few repeating values, with LZ compression on top; and LZ-based array compression for all other types.

(Interested readers can learn more about how TimescaleDB’s native compression works, as well as this explanation of time-series compression algorithms and how they work.)

Moreover, for the ~1 billion benchmark dataset, the Mongo-recommended method used more disk space than both the Mongo-naive method and TimescaleDB, making it worse than Mongo-naive on insert performance.

That said,  we recommend doing an honest analysis of your insert needs. If your insert performance is far below these benchmarks (e.g., if it is 2,000 rows/second), then insert performance will not be your bottleneck, and this comparison becomes moot.

Write Performance and Disk Usage Summary

  • TimescaleDB outperforms both methods of storing time-series data in MongoDB by between 69 % (vs. Mongo-naive) and 160% (vs. Mongo-recommended).
  • Mongo-naive shows better write performance (154 % the ingest rate) and uses less disk space (85 % as much disk) than Mongo-recommended.
  • TimescaleDB uses 10x less disk space than MongoDB, using 9 % of the storage footprint of Mongo-naive and 8 % of the storage footprint of Mongo-recommended.
  • If your insert performance is far below these benchmarks (e.g., if it is 2,000 rows/second), then insert performance will not be your bottleneck.

Query Performance

Before we compared MongoDB against TimescaleDB, we first evaluated the query performance between the two MongoDB methods.

By this point, Mongo-naive had demonstrated better write performance with simpler implementation and lower disk usage, but we suspected that Mongo-recommended would outperform Mongo-naive for query performance, justifying its recommendation by the MongoDB team and users.

And, if there were a clear winner between the two methods for simple queries, we could save ourselves some time by not implementing our full query set against both methods.

So, we first compared the two MongoDB methods using three “single rollup” queries (single groupby on time), one “double rollup” query (double groupby on time and device hostname), and one “aggregate” query (max reading over a time period). All queries were run with 1000 different parameter permutations (i.e., with random time ranges and hosts), from which we recorded the mean time for each database respectively.

Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of MongoDB-recommended compared to MongoDB-naive.

Here are the results:

Mongo-recommended outperforms Mongo-naive by 5x-68x, depending on the query
Mongo-recommended outperforms Mongo-naive by 5x-68x, depending on the query

Mongo-recommended outperforms Mongo-naive in all of the queries above, demonstrating 550-6,800 % of the performance of Mongo-naive. This performance difference makes it clear to see why it is the recommended method for storing time-series data in MongoDB, despite having slower ingest performance (1.5x slower) and more disk usage (16 % more) than Mongo-naive.

Given this significant difference in query performance, compared to the modest difference in write performance and disk storage footprint, we decided that this approach was probably the best setup for storing time-series data in MongoDB. Thus, for the remainder of this post and our analysis, we use the “Mongo-recommended” setup whenever benchmarking MongoDB.

(Note to readers: If all you want to get out of this post is the best way to store time-series data in MongoDB, here’s your answer: use the “Mongo-recommended” method. If ingest performance is more important to you than query performance, use “Mongo-naive.” But, if you want even better performance than either Mongo method, you may want to continue reading.)

Part 2: TimescaleDB vs. MongoDB

Having settled on which MongoDB method is best, let’s evaluate MongoDB vs. TimescaleDB for querying time-series data.

The best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common query patterns. The results shown below are the average from 1000 queries with different parameter combinations (time range, device ID) for each query type.

Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to MongoDB (highlighted in orange when TimescaleDB is faster, in green when MongoDB is faster).

Results of benchmarking query performance between MongoDB and TimescaleDB, with TimescaleDB showing 200 % to 5,400 % of the performance of MongoDB
Results of benchmarking query performance between MongoDB and TimescaleDB, with TimescaleDB showing 200 % to 5,400 % of the performance of MongoDB

Let’s unpack the results for each query type below:

Simple rollups

For simple rollups (i.e., groupbys), when aggregating one metric across a single host for 1 or 12 hours or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB performs comparably to or outperforms MongoDB.

In particular, when aggregating one or more metrics on a single device for a single hour, the two databases show fairly equal performance. However, when aggregating one or more metrics across multiple devices for multiple hours, TimescaleDB shows between 208 % and 302 % of the performance of MongoDB.

Aggregates

For calculating a simple aggregate (i.e., finding the maximum value) for metrics from one or more devices, TimescaleDB outperforms MongoDB. In our benchmark, TimescaleDB demonstrates 396 % of the performance of MongoDB when aggregating 8 metrics across 4,000 devices and 195 % when aggregating one metric across 4,000 devices.

Double rollups

While single rollups and aggregates are somewhat comparable across the two databases, other more complex queries are not. For double rollups aggregating metrics by time and another dimension (e.g., GROUPBY time, deviceId), TimescaleDB shows large gains.

When aggregating one metric per device, per hour, for some 24-hour window, TimescaleDB showed 1,507 % (or 15x) the performance of MongoDB.

When aggregating 10 metrics, Timescale showed 1,327 % (or 13x) of the performance of MongoDB.

The largest performance difference came when aggregating 5 metrics per device, per hour, for some 24-hour window, where TimescaleDB achieved 2,149 % the performance of MongoDB, or 21x.

Notice that these queries take in the order of tens of seconds (rather than milliseconds), so a 13-21x performance gain is very noticeable.

Complex queries

For complex queries that go beyond simple rollups or aggregates, the comparison is much more clear-cut: TimescaleDB vastly outperforms MongoDB here (in some cases, more than 50 times faster).

Lastpoint

The first complex query (‘lastpoint’) finds the latest reading for every device in the dataset. This query type is commonly used in IoT and DevOps for analysis and monitoring.

While our dataset has all devices reporting at consistent intervals, this query can be troublesome to implement in the general case because it could be that some devices have not reported in quite a long time, potentially causing a lot of documents (MongoDB) or rows (TimescaleDB) to be scanned.

We are able to do some clever query construction in both to get a list of distinct devices, which allows both setups to stop searching when every device has a point associated with it.

We utilize JOINs in both systems. However, while MongoDB does support JOINs, they are not as natural to work with or "feature-full" as they are for relational databases like TimescaleDB. For the more complex lastpoint query, TimescaleDB shows 5,399 % (or 54x) of the performance of MongoDB.

Groupby-Orderby-Limit

The second complex query (‘groupby-orderby-limit’) does a single rollup on time to get the MAX reading of a CPU metric on a per-minute basis for the last 5 intervals for which there are readings before a specified end time. This is another type of query common in DevOps and IT monitoring workloads.

This is tricky because those last 5 intervals could be the 5 minutes before the end time, or if there is no data for some minute periods (i.e., “gaps”), they could be spread out, potentially needing a search from the beginning just to find all 5.

In fact, a full table scan was the query strategy needed for MongoDB, while TimescaleDB has intelligent planning to utilize its indexes. This resulted in TimescaleDB showing 2,108 % (or 21x) of the performance of MongoDB.

Read performance summary:

  • For simple queries and aggregates, TimescaleDB generally outperforms MongoDB, showing between 1-4x better performance than MongoDB.
  • For double rollups, TimescaleDB vastly outperforms MongoDB, showing 13-21x performance improvement.
  • For complex queries, which are commonly used to analyze and monitor devices for DevOps and IoT use cases, TimescaleDB again vastly outperforms MongoDB, showing up to 53x better performance.
  • If query performance is your most important requirement, TimescaleDB is the clear choice for both simple and complex queries.

NoSQL vs. SQL: Query Language Comparison

Performance is not the only aspect to consider when deciding which database to use to store time-series data. Developer experience and implementation complexity are also important factors.

With that in mind, we compared the query language differences between TimescaleDB and MongoDB, using the complex query “groupby-orderby-limit” from our performance analysis

Once again, we may be biased, but we found SQL —with 5 lines of code—much simpler than MongoDB's document format—with 72 lines of code. This is a crucial criterion for sustainable software development since humans create, maintain, and use these systems at the end of the day. This is another point in favor of SQL in the ongoing NoSQL vs. SQL debate.

TimescaleDB query (SQL):

SELECT date_trunc('minute', time) AS minute, max(usage_user)
FROM cpu
WHERE time < '2016-01-01 19:47:52.646325 -7:00'
GROUP BY minute
ORDER BY minute DESC
LIMIT 5

And here’s that same query expressed in MongoDB.

MongoDB query:

[
  {
    $match: {
      measurement: "cpu",
      key_id: {
        $in: ["20160101_00", "20160101_01", "20160101_02", "20160101_03", "20160101_04", "20160101_05", "20160101_06", "20160101_07", "20160101_08", "20160101_09", "20160101_10", "20160101_11", "20160101_12", "20160101_13", "20160101_14", "20160101_15", "20160101_16", "20160101_17", "20160101_18", "20160101_19", "20160101_20", "20160101_21", "20160101_22", "20160101_23", "20160102_00", "20160102_01", "20160102_02", "20160102_03", "20160102_04", "20160102_05", "20160102_06", "20160102_07", "20160102_08", "20160102_09", "20160102_10", "20160102_11", "20160102_12", "20160102_13", "20160102_14", "20160102_15", "20160102_16", "20160102_17", "20160102_18", "20160102_19", "20160102_20", "20160102_21", "20160102_22", "20160102_23", "20160103_00", "20160103_01", "20160103_02", "20160103_03", "20160103_04", "20160103_05", "20160103_06", "20160103_07", "20160103_08", "20160103_09", "20160103_10", "20160103_11", "20160103_12", "20160103_13"]
      }
    }
  },
  {
    $project: {
      _id: 0,
      key_id: 1,
      tags: "$tags.hostname",
      events: 1
    }
  },
  {$unwind: "$events"},
  {
    $project: {
      key_id: 1,
      tags: 1,
      events: {
        $filter: {
          input: "$events",
          as: "event",
          cond: {
            $and: [
              {$gte: ["$$event.timestamp_ns", 1451606400000000000]},
              {$lt: ["$$event.timestamp_ns", 1451827606646325489]}
            ]
          }
        }
      }
    }
  },
  {$unwind:$events},
  {
    $project: {
      time_bucket: {
        $subtract: [
          "$events.timestamp_ns",
          {$mod: ["$events.timestamp_ns", 60000000000]}
        ]
      },
      field: "$events.usage_user"
    }
  },
  {
    $group: {
      _id: "$time_bucket",
      max_value: {$max: "$field"}
    }
  },
  {$sort: {_id: -1}},
  {$limit: 5}
]

Two things that jump out besides the verbose JSON syntax:

  • First, to efficiently stop the query, the client running the query will have to compute the subset of documents to look in, which creates the lengthy list in the first $match aggregator above.
  • Second, to unpack the 60x60 matrices in each document, the $unwind/$project/$unwind pattern is needed to efficiently expand those matrices while removing empty time periods. This pattern is actually needed for almost all of the queries we looked at here, which makes all the queries verbose and potentially daunting to debug.

In this example, the SQL query is far shorter and simpler, making it easier to comprehend and debug. This is not the only advantage of using SQL (and TimescaleDB) to query time-series data: SQL has a rich tradition and history, including familiarity among millions of developers and a vibrant ecosystem of tutorials, training, and community leaders.

In short, as the third-most commonly used programming language among developers, choosing SQL means you’re never alone.

Most popular Programming, Scripting, and Markup Languages. Source: 2020 Stack Overflow Developer Survey
Most popular Programming, Scripting, and Markup Languages. Source: 2020 Stack Overflow Developer Survey

Conclusion

TimescaleDB delivers 260 % higher insert performance, up to 53x faster queries and better developer experience vs. MongoDB

Is it surprising that TimescaleDB, a database purpose-built for time-series data, outperforms MongoDB, a general-purpose document store when it comes to time-series data?

Not necessarily—but there are enough blogs, talks, and other material out there about using MongoDB for time-series data that we felt we needed to do an evaluation.

We understand that for many users, MongoDB offers the benefit of being easy to learn and quick to set up. Yet, for time-series data, setting up MongoDB to deliver the performance your workloads require is not simple, as we’ve shown in our analysis, and requires careful thought about your schema design.

If you simply dump each reading into a new document, you’re in for longer and longer wait times as more data accumulates and you want to query it. Moreover, implementing MongoDB's recommended time-series approach requires writing client-side code and using fairly verbose queries.

We may be biased, but we'd contend that rather than doing all of the above and forcing MongoDB—a general-purpose document store—to behave like a time-series database, opting for a database built specifically for time-series data is a better use of time and resources.

And, as we've shown, when it comes to time-series workloads, TimescaleDBa purpose-built time-series database—delivers significantly better results on every dimension. If you're weighing your options, based on our analysis, TimescaleDB is the clear choice.

P.S. If you’re looking for more comparisons of database performance for time-series data, read our studies comparing  Amazon Timestream vs. TimescaleDB and InfluxDB vs. TimescaleDB.

Want to Learn More About TimescaleDB?

Create a free account to get started with a fully managed TimescaleDB instance (100 % free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️  are  appreciated!)

Join our Slack community to ask questions, get advice, and connect with other developers (our co-founders, engineers, and passionate community members are active on all channels).

Ingest and query in milliseconds, even at terabyte scale.
This post was a collaboration between
19 min read
Product & Engineering
Contributors

Related posts