Simplified Prometheus monitoring for your entire organization with Promscale

Simplified Prometheus monitoring for your entire organization with Promscale

Computer infrastructure is rapidly moving to cloud-native architectures and Kubernetes. Observability is key to successfully operating systems in these highly dynamic environments and  Prometheus has quickly become the de-facto standard for collecting and analyzing metrics. Engineers love its simple pull-based model with target auto-discovery for collecting multi-dimensional metrics, powerful query language (PromQL), and vibrant open-source community that has already built hundreds of integrations with other tools.

We love Prometheus for the above reasons and more – and it’s a key component of the observability stack we use to ensure the reliability of Timescale Forge, the modern, cloud-native relational database platform for time-series built atop TimescaleDB.

We are committed to making Prometheus the foundation of observability for large-scale systems everywhere.

To this end, last October, we announced Promscale, an open-source, scalable, and robust storage system for Prometheus metric data. Built on top of the rock-solid foundation of TimescaleDB and PostgreSQL, Promscale inherits all of the capabilities of TimescaleDB and PostgreSQL, including full SQL support, advanced functionality for time-series analysis, and a robust ecosystem of tools, integrations, visualizations, and much more. Promscale is also 100% PromQL-compliant, enabling developers to be immediately productive if they prefer to use Prometheus’ built-in query language.

Since launching, we’ve found that developers love Promscale for two primary reasons:

  1. Promscale stores your data in TimescaleDB, a scalable, operationally mature, battle-tested time-series database built on top of PostgreSQL. As a result, users get built-in access to advanced database capabilities, like replication, data compression, hot backups, continuous aggregates, and many more.
  2. Promscale lets you perform in-depth data analysis using SQL (in addition to PromQL). Users combine PromQL’s benefits with SQL, or simply use SQL, one of the most popular and well-known languages in the world.

As adoption of Prometheus in an organization grows, many teams set up individual Prometheus instances (let’s call them tenants) to monitor each Kubernetes cluster. Sometimes this happens organically, where one team is unaware of another team’s use of Prometheus, or it may happen by design, in order to obtain scale and robustness. No matter the reason, multi-tenant Prometheus introduces data silos, since Prometheus metrics across an organization now live in different data stores, and operational complexity, because all those data stores now need to be managed.

To simplify operations and get a holistic view of their systems, organizations start to look for a centralized data store (i.e., one place for all of the metrics from all of the different tenants) – or build one themselves. Given its scalability, robustness, and advanced querying capabilities Promscale is a natural fit for scenarios where users want to compare and contrast resource utilization, uptime, and performance across the entire organization.

Today, we’re excited to announce multi-tenancy support in Promscale, available immediately 🔥 . We’ve built Promscale as a centralized store for metrics, and now developers can enjoy the same operational maturity and query flexibility when storing metrics across their entire organization.

With today’s release, Promscale now includes:

  • Operational maturity: Centralizing data makes avoiding downtime even more critical. Building on top of PostgreSQL and TimescaleDB enables you to collect all of your important metrics with confidence.
  • Advanced data analysis: Use PromQL and SQL (or both) to build queries that give you insight into the performance of your entire system.
  • Faster and more robust cross-tenant queries: Effortlessly decorate your metrics with tenant-specific information, so that you can still easily query data by tenant, but also query data across tenants and across the organization.
  • Flexible control and permissions: Customize which users can access metrics for specific tenants to ensure only authorized users see a tenant’s data. In addition to being useful for increased security, permissions make it easier for users to find the data they need vs. sorting through all organizational metrics.

Read on for a primer on why we’ve built Promscale support for Prometheus multi-tenancy, the scenarios and challenges it solves for, the types of queries it frees you to make, as well as how to set up multi-tenancy for your team or organization.

To get started right away:

And, if these are the types of challenges you’d like to help solve, we are hiring (see all roles)!

Introducing multi-tenancy support in Promscale

To better understand the problem with multiple tenants described above, let’s look at a common scenario that occurs in medium-to-large organizations:

The platform team of SomeCompany provides Kubernetes infrastructure to a number of development teams inside the company. Each team’s Kubernetes cluster is monitored with its own Prometheus instance. The platform team has agreed to internal SLAs with each of those development teams, and the platform team is responsible for ensuring each Kubernetes cluster is healthy and performs as expected.

The platform team is also responsible for tracking and billing each team’s cost center for the fees associated with running their respective clusters. To do this, the platform team runs a service inside each Kubernetes cluster that generates billing information, stores the data as metrics in Prometheus, and then uses these billing metrics to measure and bill the teams on a per-cluster basis.

To simplify operations and perform analysis across all clusters, the platform team wants to store all Prometheus metrics in a single (central) store. SomeCompany’s development teams have requested access to the telemetry and billing information for the clusters they own, but they don’t want to (and shouldn’t!) see data from other teams for various reasons. For example, individual development teams need to make sure the queries they run are always scoped to their data, so they don’t mistakenly query data from other tenants as they troubleshoot problems - which could lead to delays or faulty fixes - or analyze customer behavior - which could lead to making decisions that negatively impact the user experience for their product.

This is a diagram of the architecture the platform team is envisioning:

Diagram showing a multi-tenant Prometheus use case with a centralized data store.
Example architecture diagram for a multi-tenant Prometheus use case.

How can SomeCompany’s platform team solve for all of those use cases?

The simplest option would be to use Prometheus and its default support for external labels. This would work to capture metrics collected in various Prometheus tenants, but it wouldn’t provide any security or access control guarantees without a custom solution bolted on top. (Even something as simple as ensuring that all incoming data includes a tenant’s label would require custom integration work!)  

Or, they could use Promscale with multi-tenancy enabled. 😎

Promscale’s multi-tenancy support includes the ability to ingest metrics from different Prometheus tenants, label them with tenant information, query by tenant label, and restrict the tenants that a particular Promscale instance can ingest and query. Thus, SomeCompany’s platform team would be able to query data and billing metrics for specific tenants and across all tenants, while individual development teams’ queries would be “restricted” to the metrics from their tenant.

How Promscale multi-tenancy support works and core features

Before we dive into specific steps, let’s look at a brief overview of how Promscale multi-tenant support works, basic architectural components, and what it allows developers to accomplish.

To begin, each Prometheus instance needs to be configured to indicate which tenant it belongs to when setting up remote_write. This will ensure that the correct tenant is attributed when sending data to a Promscale Connector.

Once configured and upon receiving metrics, the Promscale Connector ensures each metric is decorated with the respective __tenant__ label so that the data can be differentiated by the tenant name. This information then makes it easy to use PromQL (or SQL) to write queries for specific tenants or across multiple tenants.

To limit what data is available to a set of users, you set up an additional Promscale Connector and configure it to only allow query access to a specific tenant or group of tenants. Finally, you only allow those users to query data via that Promscale Connector. This is typically achieved by configuring a Prometheus data source for the Promscale Connector in Grafana, and then setting up the appropriate data source permissions.

Core features and capabilities of multi-tenancy in Promscale include:

  1. Easy multi-tenancy configuration, using headers or Prometheus external labels, giving users flexibility in how they configure their systems.
  2. Cross-tenant queries in PromQL and SQL, streamlining how users query and analyze data across the entire organization.
  3. Support for combining data with and without tenant information in the same store, enabling users to evolve from a single-tenant to multi-tenant design, as well as support mixed deployments.
  4. Restricting the set of valid tenants a Promscale instance can ingest or query, enforcing access control.

Configuring Prometheus to send tenant information to Promscale

For multi-tenancy to work, you have to configure Prometheus to send tenant information to Promscale, so that Promscale can identify which tenant the data originated from.

There are two ways to specify the tenant information, both done through the Prometheus configuration file:

  1. Pass a  __tenant__ label with all metrics (recommended)
  2. Use the TENANT HTTP header.

See below for steps to configure Prometheus using either method.

Pass a  __tenant__ label with all metrics (recommended)

In Prometheus, you can leverage external labels for this. Prometheus will automatically add the __tenant__ label to all metrics before they’re sent to Promscale.

global:
 scrape_interval:    5s
 evaluation_interval: 30s
 external_labels:
   __tenant__: tenant-A

Use the TENANT HTTP header

The Prometheus configuration file allows you to set any number of HTTP headers to be sent with every remote_write request to Promscale.

remote_write:
-  url: http://localhost:9201/write
   headers:
     TENANT: team-1

Once set up, the Promscale Connector will retrieve the value of the TENANT HTTP header. If that tenant is authorized in that Promscale Connector, Promscale will ingest and decorate all the metrics in the remote_write request by appending a __tenant__ label. using the value of the TENANT header.

Enabling multi-tenancy in Promscale

You can enable multi-tenancy in Promscale by setting the PROMSCALE_MULTI_TENANCY=true environment variable when starting the Promscale Connector, or by passing the  -multi-tenancy parameter.

With this, Promscale will accept data from all tenants, both for write and read, and will add the corresponding __tenant__ label to incoming metrics.

If you want Promscale to allow ingest and query for data only from specific tenants, pass those tenant names separated by commas via the -multi-tenancy-valid-tenants parameter or the PROMSCALE_MULTI_TENANCY_VALID_TENANTS environment variable.

For example, if SomeCompany wants to allow Promscale to ingest and query data only for development teams 1 and 2, they’d set parameters like so:

-multi-tenancy-valid-tenants=team-1,team-2

With that setting, only data corresponding to team-1 or team-2 will be available, and Promscale will ignore and report all other data as unauthorized.

Note: By default, the -multi-tenancy-valid-tenants has the value allow-all, allowing all incoming tenants to be ingested and queried.

When multi-tenancy is enabled, Promscale drops all data from a Prometheus instance that isn’t configured to send tenant information. You instruct Promscale to ingest the data by passing the -multi-tenancy-allow-non-tenants parameter or the PROMSCALE_MULTI_TENANCY_ALLOW_NON_TENANTS=true environment variable when launching the Promscale Connector.

Querying multi-tenant data

Using PromQL

As explained in the previous section, multi-tenant data will include a special, defined __tenant__ label.

To filter by tenant name in PromQL, simply apply the tenant name(s) using the __tenant__ label matcher for your query.

For example, if SomeCompany wants to find the number of CPU-hours team 1 used and will be billed for, they’d run the following query:

cpu_hours_total{__tenant__=”team-1”}

This query returns all the time-series data with the metric cpu_hours_total from team 1.

From there, SomeCompany can run a cross-tenant query against two teams, team 1 and team 2, to calculate total cpu hours used across both teams:

sum(cpu_hours_total{__tenant__=~”team-1|team-2”})

This type of query becomes particularly useful if you have multiple teams in your organization and you want aggregate statistics using GROUP BY queries.

For all queries, the -multi-tenancy-valid-tenants flag will be respected and data will only be returned from allowed tenants whether or not the appropriate matchers are in the query.

Check our documentation for more PromQL query examples.

Using SQL

As we saw above, PromQL offers great ergonomics for querying observability data.

With Promscale, you can also use SQL to do more sophisticated analysis and to correlate your Prometheus metrics with other relational data stored in the underlying PostgreSQL database.

For example, suppose you wanted to use SQL to obtain the equivalent of cpu_hours_total{__tenant__=”team-1”} in SQL (it returns the same data points but formatted differently):

SELECT
   * 
FROM
   prom_metric.cpu_hours_total 
WHERE
    labels ? (‘__tenant__’ == ‘team-1’);

You can also perform queries in SQL that are not possible in PromQL.

In PromQL all queries have to aggregate data within each time-series before doing other aggregations. But, when comparing results between tenants, this can skew results by weighing tenants unevenly by the number of series present. For example, when comparing the p95 latencies of HTTP requests by tenant, the grouping of requests by series is irrelevant since you want to compare all the requests of one tenant with another. The SQL query for that is shown below (note that in our views, we expose identifiers for label values using the “_id” suffix, thus the __tenant__ label’s id is the somewhat strange __tenant___id).

SELECT
   val(__tenant___id) as tenant_name, --val() looks up the tenant name from the id
   percentile_cont(0.95) WITHIN GROUP (ORDER BY value) as p95,
FROM
   prom_metric.http_requests_total
GROUP BY __tenant___id      --grouping by the id is much more efficient than using the name

It’s also possible to do even more sophisticated analysis, joining organizations’ relational data with Prometheus metrics! Joining time-series data with relational data has all manner of practical applications and is one of TimescaleDB and Promscale’s superpowers.

For example, SomeCompany team 1 could see the total number of cpu hours they have received from their customers, by company size, in the last 30 days. (This example assumes the development cpu_hours_total metric includes a customer_name label with the name of the customer, and that there is a customers table in PostgreSQL that includes the customer name and company size.)

SELECT
   customers.company_size, sum(value)
FROM
   prom_metric.cpu_hours_total
JOIN customers ON customer.name = val(customer_name_id)
WHERE time > now() - INTERVAL ‘30 day’
GROUP BY
    customers.company_size

Joining metric data with relational data can be used in a variety of use-cases: to analyze how infrastructure costs correlate with usage and/or profits; to compare performance with hardware specifications of the machines or with configuration parameters; to use inventory data to do predictive maintenance (on disk drives, for instance); or to find underutilized or overutilized resources.

Conclusion

In summary, we built Promscale -- an operationally mature long-term data store for Prometheus metrics -- to solve common challenges that we, and many other organizations, face.

With the addition of multi-tenancy, we’re making it easier to centralize Prometheus data across large, diverse organizations, while tapping into PostgreSQL’s operational maturity and enabling developers to use PromQL and SQL to perform even more advanced queries.

Centralizing Prometheus data unlocks a lot of potential for in-depth data analysis and organization-wide optimization as well as reducing operational complexity.

We’re excited to continue to identify and find new ways to help developers better query and manage their data.

Get started with Promscale

If you’re new to Promscale and want to get started with our new multi-tenancy functionality today:

  • Install Promscale today via Helm Charts, Docker, and others. Follow the instructions in our GitHub repository. As a reminder, Promscale is open-source and completely free to use. (GitHub ⭐️  welcome and appreciated! 🙏.)
  • See our docs on to enable multi-tenancy support on your Promscale instance today.
  • Check out our Getting Started with Promscale tutorial for more on how Promscale works, installation instructions, sample PromQL and SQL queries, and more.
  • Watch Promscale 101 YouTube playlist for step-by-step demos and best practices.

Whether you’re new to Promscale or an existing community member, we’d love to hear from you! Join TimescaleDB Slack, where you’ll find 7K+ developers and Timescale engineers active in all channels. (The dedicated #promscale channel has 2.5K+ members, so it’s a great place to connect with like-minded community members, ask questions, get advice, and more).

This post was written by
11 min read
Prometheus
Contributors

Related posts

TimescaleDB - Timeseries database for PostgreSQL

Explore TimescaleDB

Learn more about how TimescaleDB works, compare versions, and get technical guidance and tutorials.

Go to docs Go to products