Get step-by-step demos and learn how to define 3 different alert rules and send notifications through channels like Slack, PagerDuty and OpsGenie
We recently finished part two of our “Grafana 101” webinar series, a set of technical sessions designed to get you up and running with Grafana’s various capabilities - complete with step-by-step demos, tips, and resources to recreate what we cover.
In this one, we focus on “Getting Started with Alerts," where I go through what alerting in Grafana entails, show you how to select and set up 3 common alerts for key metrics, and create triggers to send notifications through popular channels. And, to make sure you leave ready to set up your own alerting and monitoring systems, I share various best practices and things I’ve learned along the way.
- If you missed this one, or are keen to level up your Grafana skills, we’re hosting our 3rd session, “Guide to Grafana 101: Getting Started with Interactivity, Templating and Sharing” on June 17th, 2020 (RSVP here)
What you’ll learn
Alerting is a crucial part of any monitoring setup. But, getting them set up is often tricky and time consuming, especially if you’re dealing with multiple data sources.
Thankfully, you can configure your visualizations and alerts for the metrics you care about in the same place, thanks to Grafana’s alerting functionality!
While Grafana may be best known for its visualization capabilities, it’s also a powerful alerting tool. Personally, I like using it to notify me about anomalies, because it saves me the overhead of adding another piece of software to my stack – and I know many community members feel the same.
I break the session into four parts:
Alerts tell us when things go wrong and get humans to take action.
When you implement alerts in any scenario, there are two important universal best practices:
- Avoid over-alerting: If an engineer gets an alert too frequently, it ceases to be useful or serve its purpose (i.e., instead of responding quickly, people will quickly tune them out as noise).
- Select use-case specific alerts: Different scenarios require monitoring different metrics, so alerts for monitoring a SaaS platform (site uptime, latency, etc.) are different than an alerts setup for monitoring infrastructure (cluster health, disk usage, CPU/memory use).
Alerts in Grafana
In this section, I cover how alerts work in Grafana and their two constituent parts: alert rules and notification channels. Note: Grafana’s alerting functionality only works for graph panels with time-series output. At first this may seem limiting, but, thankfully, it isn’t for two reasons:
- You can convert any panel type (e.g., gauge, single stat, etc.) into a graph panel and then set up your alerts accordingly.
- Your graphs must have the output format of ‘time-series’, which is a reasonable constraint: you want to monitor how a certain metric changes over time, so your data is inherently time-series data.
You’ll also learn about the anatomy of alert rules and conditions, see how the
FOR parameter works, and understand the various states alerts can take, depending on whether their associated alert rules are
Let’s Code: 3 Alerts and 3 Notification channels
After seeing the basics, we jump into the fun part: creating and testing alerts!
Using the scenario of monitoring a production TimescaleDB database, we set up different types of alerts for common monitoring metrics and connect our alert monitoring to popular notification channels:
Alert Type: Alerts using
- Metric: Sustained high memory usage
- Notification channel: sent via Slack, where we have a channel to notify our DevOps team about new alerts from our Grafana setup.
Alert Type: Alerts without FOR
- Metric: Disk usage
- Notification channel: sent via PagerDuty, where an incident is automatically created and relevant DevOps teams and support personnel are notified (according to our pre-configured PagerDuty escalation policies).
Alert Type: Alerts with
- Metric: Database aliveness
- Notification channel: sent via OpsGenie, where an alert is created and sent to the DevOps team and other support personnel (according to the notification policies we’ve configured for our team in OpsGenie). We'll use Slack as an additional notification channel as well.
Alerts using FOR
This part of the demo shows how to define an alert for sustained high memory usage on the database, using the Grafana alerting parameter
FOR. The parameter
FOR specifies the amount of time for which an alert rule must be true before the ALERTING state is triggered and an alert is sent via a notification channel.
FOR is common for many alerting scenarios, as you often want to wait for your alert rule to be true for a period of time in order to avoid false positives (and waking people up in the middle of the night without cause).
Once we’ve defined our alert rule in Grafana, I show you how to set up Slack as a notification channel, so alert messages reach you (or the right team members) in a timely manner. You’ll see how to customize the message body with pertinent info and send notifications to specific channels and mention team members.
I also take you through how
FOR alerts work in Grafana, using a state transition diagram to give you a mental model for when, how, and why to use them.
Alerts without FOR
This second part of the demo shows how to use a simple threshold condition to define an alert for high disk volume; this alert rule doesn’t use the
FOR parameter, so alerts are sent as soon as the rule is triggered. In this example, there’s no need to to use the
FOR parameter, since disk usage (the metric we’re alerting on) doesn’t fluctuate up and down with time and usually only increases. Therefore, we can send out an alert as soon as our alert condition is
Once we’ve defined our alert rule, I show you how to connect Grafana to PagerDuty, so that alerts in Grafana create cases in PagerDuty and notify teams via phone, email, or text, based on your PagerDuty configurations (e.g., whatever rules and notification methods you’ve set up in PagerDuty).
As in the first example, I use a state transition diagram to help you visualize how this works and when you’d use it versus other types.
Alerts with NO DATA
The final part of the demo shows how to define an alert for “aliveness” on our database, so we know if our database is up or down. Like our high disk volume alert, we use a threshold condition, and to show you to trigger
NO DATA alerts, I turn my demo database off to simulate an outage/downtime.
This immediately triggers
NO DATA alerts for other alert rules on metrics from the database, namely sustained high memory and disk usage, the two alert rules we set up earlier in the demo.
NO DATA alerts rules are useful to distinguish between an alert rule being true and there being no data with which to evaluate the alert rule. You would use
ALERTING as the state for the former condition and
NO DATA as the state for the latter. Usually
NO DATA alert rules indicate that there is a problem with the data source (e.g., the data source is down or has lost connection to Grafana).
From there, I show you how to connect Grafana to OpsGenie, so that alerts in Grafana create cases and alerts in OpsGenie, which can then notify various teams, using methods like email, text, and phone call. I also show how to send notifications via our existing Slack notification channel.
And, of course, I share a diagram for this one too :).
Resources and Q+A
Want to recreate the sample monitoring setup shown in the demo? Or perhaps you want to modify it to use your data sources and notification channels to set up your own monitoring system? No worries, we have you covered!
We link to several resources and tutorials to get you on your way to monitoring mastery:
- Reason about Grafana Alert States with this handy reference chart
- Replicate our demo by following our Grafana alerting tutorial
- Follow along with the session recording
- Use our Prometheus Adapter and Helm Charts to install Grafana, TimescaleDB and Prometheus in your Kubernetes cluster, with one line of code.
- Get started with Timescale Cloud - our managed and hosted time-series database. You’ll get $300 in credits to help you get up and running
- Join our Slack to ask questions and get Grafana help from Timescale engineers - including Grafana contributors - and our helpful developer community.
Here’s a selection of questions we received (and answered) during the session:
Q: Does alerting work with templating or parameterized queries?
A: Template variables in Grafana do not work when used in alerts. As a workaround, you could set up a custom graph that only displays the output of a specific value that you’re templating against (e.g., a specific server name) and then define alerts for that graph.
If you use Prometheus, Prometheus Alertmanager might be worth exploring as well.
Q: If we set
FOR to 5min, we wait for 5mins to go from
ALERTING. Do we wait for the same period of time to transition back from
A: No, the
FOR parameter defines the amount of time the alert rule must be in the
TRUE state before an alert is sent.
FOR parameter applies to transitions from
ALERTING only. You don’t need to wait for the
FOR period of time to go from
As soon as the alert rule evaluates
FALSE (i.e., the issue is resolved), the alert state will change from
OK and be re-evaluated the next time your alert job runs.
Q: When specifying the condition to use in your alert rule, what does “now” mean? Why do you select to alert on Query A from 5 minutes ago till now? Isn't it always now?
A: Grafana requires you to have a period of time to evaluate if alerts need to be sent. This time period is the period over which you want to do calculations on, which is why many conditions have aggregate functions like
So, when we say
Query(A, 5m, now), it means that we evaluate the alert rule on Query A from 5m ago to now. You can also set alert rules that say
Query(A, 6m, now-1min) to account for a lag in data being inserted in your data source or network latency.
This is important, because if an alert condition uses
Query(A, 5m, now) and there is no data available for
NO DATA alerts will fire (i.e., there isn’t any data with which to evaluate the rule).
You’ll want to select a time interval to evaluate your aggregate function that’s greater (longer) than the time period in which you add new data to the graph that’s triggering your alerts.
For example, in my demo, we scrape our database every 10 seconds. If we set an alert rule like
Query(A, 5s, now), we’d end up getting
NO DATA alert states constantly, because our data is not that fine grained.
Ready for more Grafana goodness?
Thanks to those who joined me live! To those who couldn’t make it, myself and the rest of Team Timescale are here to help at any time. Reach out via our Slack and we’ll happily assist!
Want to level up your Grafana skills even more? Sign up for session 3 in our Grafana 101 technical series on Wednesday, June 17: “Guide to Grafana 101: Getting Started with Interactivity, Templating and Sharing”
- 🔍RSVP to learn about making interactive, re-usable dashboards that you can version control and collaborate on with teammates!
To learn about future sessions and get updates about new content, releases, and other technical content, subscribe to our Biweekly Newsletter.
Excited to see you at the next session!