What are traces, and how SQL (yes, SQL) and OpenTelemetry can help us get more value out of traces to build better software

Promscale, the observability backend powered by SQL, now includes support for collecting traces via OpenTelemetry in beta.

Developing software is hard. Debugging complex software systems is often harder. When our software systems are not healthy, we need ways to quickly identify what is happening, and then dig deeper to understand why it is happening (and then fix the underlying issues). But in modern architectures based on microservices, Kubernetes and cloud, identifying problems (let alone predicting them) has become more and more difficult.

Enter observability, built on the collection of telemetry from modern software systems. This telemetry typically comes in the form of three signals: metrics, logs, and traces.

Metrics and logs are well known and have been widely adopted for many years through tools like Nagios, Prometheus, or the ELK stack.

Traces, on the other hand, are relatively new and have seen much lower adoption. Why? Because, for most engineers, tracing is still a relatively new concept. Because getting started takes a lot of manual instrumentation work. And because, once we have done that work, getting value out of trace data is hard (for example, most tracing tools today just provide the ability to look up a trace by id or apply very simple filtering like Jaeger or Grafana Tempo).

However, traces are key to understanding the behavior of and troubleshooting modern architectures.

Here we demystify traces and explain what they are, and why they are useful, using a concrete example. Then we describe OpenTelemetry and explain how it vastly simplifies the manual instrumentation work required to generate traces.

Finally, we announce the beta release of trace support in Promscale, the observability backend powered by SQL, via OpenTelemetry, and describe how Promscale and SQL enable us to get much more value out of trace data.

Read on for more. If you’d like to get started with Promscale right away:

• Install the latest version of Promscale, following the instructions in our GitHub repository (appreciate any GitHub stars!). As a reminder, Promscale is open-source and completely free to use.
• Join the TimescaleDB Slack community, where you’ll find 7K+ developers and Timescale engineers active in all channels. (The dedicated #promscale channel has 2.9K+ members, so it’s a great place to connect with like-minded community members, ask questions, get advice, and more).

If you would like to connect with us, catch our talks at PromCon North America. We are also at KubeCon+CloudNativeCon North America 🎉 Come to say hi to the Timescale booth (#S76) or join the Timescale virtual booth to chat with Promscale engineers, see Promscale in action during Office Hours session (Oct 14, 10:30 am PDT), and get cool swag!

What is a trace?

A “trace” represents how an individual request flows through the various microservices in a distributed system. A “span” is the fundamental building block of a trace, representing a single operation. In other words, a trace is a collection of underlying spans.

Let’s look at a specific example to illustrate the concept.

Imagine that we have a news site that is made of four micro-services:

• The frontend service, which serves the website to our customer’s browser.
• The news service, which provides the list of articles from our news database that is populated by the editorial system (and has the ability to search articles and get individual articles together with their comments).
• The comment service, which lets you save new comments in the comment database and retrieve all comments for an article.
• The advertising service, which returns a list of ads to display from a third-party provider.

When a user clicks on the link to view an article on her browser this is what happens:

1) The frontend service receives the request.

2) The frontend service calls the news service passing the identifier for the article.

3) The news service calls the news database to retrieve the article.

4) The news service calls the comment service to retrieve the comments for the article.

5) The comment service calls the comment database and runs a query to retrieve all comments for the article.

6) The news service gets all comments for the article and sends the article and the comments back to the frontend service.

7) The frontend service calls the advertising service passing the contents of the article.

8) The advertising service makes a REST API request to the third-party ads provider with the contents of the article to retrieve the list of optimized ads to display.

9) The frontend service builds the HTML page including the news, the comments and the ads send the response back to the user’s browser.

When adding trace instrumentation to each of those services, you generate spans for each operation in the execution of the request outlined above (and more if you want more detailed tracking). This would result in a hierarchy of spans like shown in the diagram below:

The entry point for the request is the frontend service. The first span the frontend service emits covers the entire execution of the request. That span is called the root span. All other spans are descendants of that root span. The length of each span indicates their duration.

What do traces show?

As you can see from the diagram above there are two key pieces of information we can extract from a trace:

1. A connected representation of all the steps to process individual requests which very easily allow us to zero-in on the service and operation that is causing issues in your system when troubleshooting a production problem.
2. The dependencies between different components in the system which we could use to build a map of how those components connect to each other. In a large system with tens or hundreds of components it is important to explore and understand the topology of the system to identify problems and improvements.

Faster troubleshooting with traces

Let’s see through a practical example of how traces help identify problems in your applications faster.

Continuing with our example, imagine that less than 1% of the REST API requests to the ads provider are suffering from slow response time (30 seconds). But those requests always come from the same set of users who are complaining to your support team and on Twitter.

We look into the problem but aggregate percentile metrics (see our blog post on percentile metrics to learn how to use them) are not indicating any problem because overall performance (99th percentile of response time, p99 for short) is great since this only impacts a small number of requests.

Then we start looking into our logs but that’s a daunting task. We have to look at logs for each individual service and there are concurrent requests that make it very hard to read and connect those logs to identify the problem.

What we need is to reconstruct requests around the time users reported the issue, find the requests that were slow, and then determine where that slowness happened.

With logs, we have to do that manually which for a high traffic site it would take hours to do.

With traces, we can do that automatically and quickly identify where the problem is.

One simple way to do it is to search for the slowest traces, that is, top ten root spans with the highest duration during the time the users complained and look at what’s consuming most of the execution time. In our example, by looking at the visual trace representation we would very quickly see that what’s common in the slowest traces is that the request to the ads provider REST API is taking too long. An even better way (if your tracing system allows for that) would be to run a query to retrieve the slowest spans in the execution path of the slowest traces which would return the spans tracking the REST API calls.

Note that looking at the p99 response time of those REST API calls would not have revealed any problem either because the problem occurred in less than 1% of those requests.

We’ve quickly narrowed down where the problem is and can start looking for a solution as well as inform the ads provider of the issue so they can investigate it.

One quick fix could be to put in place a one-second timeout in the API request so that a problem with the ads provider doesn’t impact your users. Another more sophisticated solution could be to make ads rendering an asynchronous call so it doesn’t impact rendering the article.

Traces help us proactively make software better

We can also use traces to proactively make our software better. For example, we could search for the slowest spans that represent database requests to identify queries to optimize. We could search for requests (i.e., traces) that involve a high number of services, or many database calls, or too many external service calls, and look for ways to optimize them and/or simplify the architecture of the system.

Recap

We have seen how traces are useful to troubleshoot problems in microservices environments faster and discover improvements that we could implement to help us build better software that delights our customers.

So what’s needed to get the benefits that traces provide? A tool to instrument services to generate traces, and a system to store, query and visualize them.

Continue reading to learn how OpenTelemetry makes instrumentation easier, and how Promscale and SQL enable us to get more value out of traces, faster.

Traces in OpenTelemetry

OpenTelemetry is a vendor-agnostic emerging standard to instrument and collect traces (and also metrics and logs!) that can then be sent to and analyzed in any OpenTelemetry compatible backend. It has recently been accepted as a CNCF incubating project and it has a lot of momentum: it’s the second most active CNCF project, only after Kubernetes, with contributions from all major observability vendors, cloud providers, and many end users (including Timescale).

OpenTelemetry includes a number of core components:

• API specification: defines how to produce telemetry data in the form of traces, metrics, and logs.
• Semantic conventions: defines a set of recommendations to standardize the information to include in the telemetry (for example attributes like the status code of a span representing an http request) and ensure better compatibility between systems.
• OpenTelemetry protocol (OTLP): defines a standard encoding, transport, and delivery mechanism of telemetry data between the different components of an observability stack: telemetry sources, collectors, backends, etc.
• SDKs: language-specific implementations of the OpenTelemetry API with additional capabilities like processing and exporting for traces, metrics and logs.
• Instrumentation libraries: language-specific libraries that provide instrumentation for other libraries. All instrumentation libraries support manual instrumentation and several offer automatic instrumentation through byte-code injection.
• Collector: this component provides the ability to receive telemetry from a wide variety of sources and formats, process it and export it to a number of different backends. It eliminates the need to manage multiple agents and collectors.

To collect traces from our code with OpenTelemetry we will use the SDKs and instrumentation libraries for the language your services are written in. Instrumentation libraries make OpenTelemetry easy to adopt because they can auto-instrument (yes, auto-instrument!) our code for services written in languages that allow for injecting instrumentation at runtime like Java and Node.js.

For example, the OpenTelemetry Java instrumentation library automatically collects traces from a large number of libraries and frameworks. For languages where that is not possible (like Go) we also get libraries that simplify the instrumentation but require more changes to the code.

Even more, developers of libraries and components are already adding OpenTelemetry instrumentation directly in their code. Two examples of that are Kubernetes and GraphQL Apollo Server.

Once our code is instrumented we have to configure the SDK to export the data to an observability backend. While we can send the data directly from our application to a backend, it is more common to send the data to the OpenTelemetry Collector and then have it send the data to one or multiple backends. This way you can simplify the management and configuration of where you want to send the data and have the possibility to do additional processing (downsampling, dropping, transforming data) before it is sent to another system for analysis.

Anatomy of an OpenTelemetry trace

The OpenTelemetry tracing specification defines the data model for a trace. Technically, a trace is a directed acyclic graph of Spans with parent-child relationships. Traces must include an operation name, start and finish timestamps, a parent Span identifier and the SpanContext. The SpanContext contains the TraceId, the SpanID, TraceFlags and the Tracestate. Optionally, Spans can also have a list of Events, Links to other related Spans and a set of Attributes (key-value pairs). Events are typically used to add to a Span one-time events like errors (error message, stacktrace and error code) or log lines that were recorded during the span execution. Links are less common but allow OpenTelemetry to support special scenarios like relating spans that are included in a batch operation with the batch operation span which would be initiated by multiple parents (i.e., all the individual spans that add elements to the batch).

Attributes can contain any key-value pair. Trace semantic conventions define some mandatory and optional attributes to help with interoperability. For example, a span representing the client or server of an HTTP request must have an http.method attribute and may have http.url and http.host attributes.

Recap

OpenTelemetry is a new vendor-agnostic standard that makes trace instrumentation and collection much easier through automation, SDKs and a protocol to provide interoperability with observability tools.

Analyzing traces with SQL

Above, we talked about several ways to get value out of traces. For example, when debugging our news application, searching for the top 10 root spans with the highest duration during the time that users complained. Or, when trying to proactively improve our news application, to search for the slowest spans that represent database requests to identify queries to optimize. Or similarly, to search for requests (i.e. traces) that involve a high number of services, or many database calls, or too many external service calls, and look for ways to optimize them and/or simplify the architecture of the system.

Yet there is no standard way to analyze trace data to ask these questions. In the past, a small number of organizations (e.g., Google, Twitter, Uber, etc.) have built their own systems to do deeper analysis on their trace data. There have also been a number of open-source efforts (e.g., Jaeger, Zipkin, Grafana Tempo) which provide a UI that is very helpful to find and visualize individual traces but don’t provide the flexibility of a language to run any query and aggregate traces as needed to derive new insights.

Turns out that we already have a universal query language for data analysis, one that most developers (and many non-developers) already know: SQL.

In fact, with SQL, we can interrogate our trace data to answer any question we need to answer, in a way not possible with existing open-source tracing tools like Jaeger and Grafana Tempo.

For example:

• List the operations with the highest error rates.
• List the slowest API methods.
• List the slowest database queries across all services.
• Identify customers suffering from the worst database query performance and how it compares to performance across all customers.
• Identify the upstream service causing the load on a service that is seeing elevated load.

Even more, with SQL you could correlate traces and metrics at query time. For example, you could show response time per service correlated with CPU and memory consumption per service if you have container metrics in the same database.

But of course, before we can use SQL to query traces (and metrics), we need to be able to store those traces in a scalable system that supports SQL, yet is designed for observability.

Enter Promscale.

Promscale was first announced one year ago as an analytical platform for Prometheus metrics. Promscale is built on the solid foundation of TimescaleDB and PostgreSQL, and therefore has full SQL support (along with many other neat features like 100% PromQL compliance). Today, companies like Digital Ocean, Electronic Arts and Catalytic rely on Promscale to scale Prometheus to do long-term retention and analysis on their metrics.

Our vision for Promscale is to enable engineers to store all observability data (metrics, logs, traces, metadata, and other future data types) in a single mature and scalable store and analyze it through a unified and complete SQL interface that provides developers with:

• Broad support for observability standards (e.g., OpenTelemetry, Prometheus/PromQL, StatsD, Jaeger, Zipkin, etc.) to simplify integration in any environment.
• Operational simplicity with just one storage system that is easy to deploy, manage, scale, and learn about.
• A familiar experience with PostgreSQL as the foundation and unified access to all data via full SQL support so they don’t need to learn other query languages.
• Unparalleled insights through the power of TimescaleDB’s advanced time-series analytical functions and Postgres’ SQL query capabilities (joins, sub-queries, etc.) to analyze and correlate observability and business data.
• 100s of out-of-the-box integrations through the PostgreSQL ecosystem: visualization tools, IDEs, ORMs, management tools, performance tuning, etc.

Today we are announcing the beta release of trace support in Promscale, our second major step in fulfilling our vision.

Promscale exposes an ingest endpoint that is OTLP-compliant which makes integration with OpenTelemetry instrumentation straightforward. Other tracing formats like Jaeger, Zipkin or OpenCensus can also be sent to Promscale through the OpenTelemetry Collector. Traces stored in Promscale can be queried with full SQL and visualized using Jaeger or Grafana.

Designing an optimized schema for traces

Promscale stores OpenTelemetry traces using the following schema:

The schema is heavily influenced by the OpenTelemetry protocol buffer definitions.

The heart of the model is the span table. The span table is a hypertable, using the power of TimescaleDB to manage the ingestion and querying of spans over time. Similarly, both links and events are hypertables. Link tables use the start time of the source span, whereas each event has its own associated time.

The span table is self-referencing. All the spans in a given trace form a tree. The span table uses the adjacency model to capture these trees - each span contains the id of the span’s parent. The root span of each trace has a parent_span_id of 0. The OpenTelemetry protocol buffers already utilize an adjacency model to represent the tree structure, so keeping this same model at the database layer makes ingestion easier. Other models such as the path enumeration model can be derived from an adjacency model, and we do this in several convenience functions.

Spans, links, events, and resources can all be decorated with zero or more attributes. Attributes are key-value pairs in which the key is a text name, and the value can be a primitive type (string, boolean, double, or 64 bit integer), or a homogeneous array.

We have chosen to use “tag” instead of “attribute” in our model. “Attribute” gets to be a real pain to type over and over and over, and by our estimation “tag” is more widely used and understood in the industry.

Tag values map very closely to json, and therefore we have chosen to store tag values in the jsonb data type. PostgreSQL has extensive support for storing and manipulating jsonb values, including indexing and jsonpath querying. By using jsonb, we can piggy-back on PostgreSQL’s features to provide rich ways of filtering spans that in many cases will be indexed operations.

Many tags will be repeated again and again across many spans. For this reason, we are not storing tags directly in the span table, but normalizing them out into tag_key and tag tables. This eliminates data duplication and greatly reduces the storage required for the span table.

Traditionally, a many-to-many relationship between spans and tags would be represented with an additional mapping table. We decided to eschew this approach. An additional mapping table would have considerably more records than the span table itself. It would need to contain both the trace_id and span_id which are 128 bits and 64 bits respectively, a timestamptz (it would need to be a hypertable in its own right), and the 64 bit ids of the tag and tag_key. This would require significant additional storage.

Instead of this mapping table, we created a domain over jsonb called a “tag_map”. A tag_map is a json object where keys are the ids of the tag_key table, and the associated values are ids of the tag table. Thus, the tag_map is a set of key-value pairs of tag_keys to tag values. We have thus essentially collapsed the mapping table into the span table.

The tag table may in some setups become quite large. We have utilized PostgreSQL’s declarative partitioning to hash partition the tag table on the key text over 64 partitions. This effectively “load balances” the tag values over many smaller tables rather than storing them all in one big table. By hash partitioning on key text, all the values for a given key will be colocated in the same partition, and thereby belong to the same partition-level indexes, improving performance. Partitioning the tags should also provide some resiliency, in that operations on one key should not impact all keys. The number 64 was chosen more or less randomly. We intend to do some testing to find an optimal number of partitions.

We are sharing how our schema works to help other developers in the community looking at solving a similar problem. Knowing the design of the underlying schema can also be helpful in situations where you may need to improve the performance of some query. However, you don’t really need (and we don’t expect you!) to understand how our schema works because Promscale comes with out-of-the-box views, functions and operators that make querying the data easier and faster that we cover in the next section.

Querying Traces using SQL

Promscale provides unified access to all your traces and metrics through a single, robust and well-known query language: SQL. Thanks to the power of SQL, the out-of-the-box views, functions and operators built-in in Promscale and the time-series analytical functions provided by TimescaleDB you can interrogate your data about pretty much any question you need to answer.

Everything you need to query traces is in the ps_trace database schema:

• 3 views: span, event and link. Those views automatically join the different tables in the schema to provide a consolidated view of a span, an event or a link as if all attributes were stored in the same table.
• A number of operators so you can easily apply conditions to resource, span, link and event attributes (we call those attributes tags in our data model).
• Functions to easily navigate through and retrieve tag values.

To understand how to query traces we will use some of the examples we listed at the beginning of this section. The data we use in the examples comes from a Kubernetes cluster running Honeycomb’s fork of Google’s microservices demo that uses OpenTelemetry instrumentation.

In the examples below we show the raw results from the SQL queries. Those results can be easily displayed in Grafana dashboards by connecting to the underlying TimescaleDB/PostgreSQL database using Grafana’s PostgreSQL datasource:

Query 1:  List the top operations with the highest error rate in the last hour

Response time, throughput and error rate are the key metrics used to assess the health of a service. In particular, high error rate and response time are key indicators of the experience we are delivering to our users and need to be tracked closely. In this example, we look at how we can use OpenTelemetry traces and SQL to identify the main sources of errors in our services.

Every span has an attribute that indicates the name of the service (service_name) and an attribute that indicates the name of the operation (name). Technically, the service name in OpenTelemetry is not a span attribute but a resource attribute. For the purposes of querying traces in Promscale we can think of service name as a span attribute as well.

Every span has another attribute that indicates whether the span resulted in an error or not: status_code. If there is an error, the value of status_code is STATUS_CODE_ERROR.

Our goal is to write a query that will return one row per individual operation with the total number of executions (number of spans), total number of executions that led to an error (spans where the status code is an error) and the percentage of spans with an error for the top 10 operations with more errors:

SELECT
service_name,
span_name as operation,
COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') as spans_with_error,
COUNT(*) as total_spans,
TO_CHAR(100*(CAST(COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') AS float) / count(*)), '999D999%') as error_rate
FROM span
WHERE
start_time > NOW() - INTERVAL '1 hour'
GROUP BY service_name, operation
ORDER BY error_rate DESC
LIMIT 10;


As you can see, the query uses the standard SQL syntax we are all familiar with: SELECT, FROM, WHERE, GROUP BY, ORDER BY.

In the SELECT clause we project the service name and the operation (which corresponds to the span name). These are also the two attributes we use to aggregate the results in the GROUP BY clause.

The other three columns we project are the number of spans with an error, the total number of spans and the error rate within each error group. To do it we use some nifty SQL capabilities available in PostgreSQL:

• COUNT (*) which returns the total number of rows, spans in this case, in a group.
• COUNT (*) FILTER which returns the total number of spans in a group matching a certain criteria. In our case we want spans with an error which are indicated by the value STATUS_CODE_ERROR in the status_code attribute.
• CAST to convert the number of spans with error to a floating point number so when it’s divided by the total count of spans it returns a floating number and not an integer. If we don’t do this then the number will be converted to the closest integer which will be 0 since that division is always less than 0.
• TO_CHAR to convert the error rate to an easy to read and understand percentage number.

Below is an example of results from this query:

 service_name |             operation             | spans_with_error | total_spans | error_rate
--------------+-----------------------------------+------------------+-------------+------------
frontend     | /cart/checkout                    |               12 |         345 |    3.478%
cart         | grpc.health.v1.Health/Check       |                0 |         707 |     .000%
cart         | hipstershop.CartService/EmptyCart |                0 |         345 |     .000%
cart         | hipstershop.CartService/GetCart   |                0 |        7533 |     .000%
checkout     | SQL SELECT                        |                0 |         361 |     .000%
checkout     | grpc.health.v1.Health/Check       |                0 |         718 |     .000%
checkout     | getDiscounts                      |                0 |         345 |     .000%
checkout     | hipstershop.CartService/EmptyCart |                0 |         345 |     .000%


Which immediately indicates that we need to take a closer look at the code behind the /cart/checkout operation since it has a very high error rate most likely leading to many lost sales!

Another thing that drives our attention in these results are the second and third rows. What’s surprising about them is that the error rate on the client side of the request (frontend - hipstershop.AdService/GetAds) is much higher than the error rate on the server side of the same request (adservice - hipstershop.AdService/GetAds). So the adservice is successfully completing the request but something is happening when transferring the response back to the frontend service.

Before we move on to the next example, let’s look at a different way to write the query that leverages additional SQL capabilities. To calculate the error rate in the query above we are using twice the functions required to count spans with error and total spans. This can lead to inconsistencies if we update the query to change the way we calculate the error rate but don’t apply those changes everywhere. To avoid that we can use SQL subqueries:

SELECT
service_name,
operation,
spans_with_error,
total_spans,
TO_CHAR(100*(CAST(spans_with_error AS float)) / total_spans, '999D999%') as error_rate
FROM (
SELECT
service_name,
span_name as operation,
COUNT(*) FILTER (WHERE status_code = 'STATUS_CODE_ERROR') as spans_with_error,
COUNT(*) as total_spans
FROM span
WHERE
start_time > NOW() - INTERVAL '1 hour'
GROUP BY service_name, operation
) AS error_summary
ORDER BY error_rate DESC
LIMIT 10;


This query first builds a dataset (error_summary) with service name, operation, spans with error and total spans and then it uses the values calculated in that dataset to compute the error rate avoiding the duplication in the initial query. This change doesn’t impact the performance of the query.

This is a straightforward example of what you can do with subqueries. SQL subqueries provide a lot of flexibility for analyzing your traces to derive new insights.

An additional consideration is that the query searches across all spans and depending on the environment and the amount of instrumentation you could see some duplicative results because of parent-child relationships between spans. In those cases it could be better to start by looking at traces (complete request) that had an error. You can do that by only searching across root spans by adding an additional condition to the where clause:

WHERE start_time > NOW() - INTERVAL '1 hour' AND parent_span_id = 0


Query 2: List the top slowest operations in the last hour

At the beginning of the previous example we identified response time and error rate as two key indicators of the experience we deliver to our users. Let’s see now how we can quickly determine bottlenecks in our services.

All spans contain a duration attribute that indicates how long it took to execute that span. To analyze the duration we will look at several statistics and in particular percentiles. To learn more about percentiles and why you should use them instead of averages take a look at our blog post on this subject.

In this example, our goal is to write a query that will return one row per individual operation with several percentiles (99.9th, 99th, 95th) and the average of the duration in milliseconds. For this one we will search across root spans so we see which user requests have the worse response time:

SELECT
service_name,
span_name as operation,
ROUND(approx_percentile(0.99, percentile_agg(duration_ms))::numeric, 3) as duration_p99,
ROUND(approx_percentile(0.99, percentile_agg(duration_ms))::numeric, 3) as duration_p999,
ROUND(approx_percentile(0.95, percentile_agg(duration_ms))::numeric, 3) as duration_p95,
ROUND(avg(duration_ms)::numeric, 3) as duration_avg
FROM span
WHERE
start_time > NOW() - INTERVAL '1 hour' AND
parent_span_id = 0
GROUP BY service_name, operation
ORDER BY duration_p99 DESC
LIMIT 10;


This query uses TimescaleDB’s approx_percentile hyperfunction to calculate the different percentiles. You could write the same query using PostgreSQL’s native precentile_cont function. However, approx_percentile is faster than percentile_cont while incurring a small error (less than 3%). With our test data, approx_percentile performs 35% faster than percentile_cont with minimal error (less than 3%). If you need more precision, you can decrease the error at the expense of lower performance by replacing percentile_agg with uddsketch.

We use the ROUND function to limit the number of decimals shown in each column. We need to use ::numeric to convert the return value of approx_percentile which is double precision to a type supported by the ROUND function, numeric in this case.

 service_name |              operation              | duration_p99 | duration_p999 | duration_p95 | duration_avg
--------------+-------------------------------------+--------------+---------------+--------------+--------------
frontend     | /cart/checkout                      |    20658.359 |     20658.359 |     1238.359 |      600.609
cart         | grpc.health.v1.Health/Check         |     1843.628 |      1843.628 |      911.864 |      132.689
frontend     | /                                   |     1159.651 |      1159.651 |      673.087 |      286.100
frontend     | /product/{id}                       |      752.567 |       752.567 |      307.197 |      118.877
frontend     | /cart                               |      582.594 |       582.594 |      270.289 |       92.472
checkout     | hipstershop.CurrencyService/Convert |      193.229 |       193.229 |      101.888 |       30.177
frontend     | /setCurrency                        |        7.504 |         7.504 |        0.908 |        0.498
payment      | grpc.grpc.health.v1.Health/Check    |        1.615 |         1.615 |        0.172 |        0.386
currency     | grpc.grpc.health.v1.Health/Check    |        1.355 |         1.355 |        0.140 |        0.420


If we look at the results we can for example quickly see that there seems to be an issue with the /cart/checkout endpoint. Note that if we just looked at the average we would think performance is good (600 ms). Even if we look at the 95% percentile performance still looks acceptable at 1.2 seconds. However, when we look at the 99% percentile we can see the performance is extremely poor (20 seconds). So somewhere in between 1% and 5% of the requests delivered a very poor user experience. This is definitely something worth investigating further.

As we can see, the results of the query are showing the performance of automated health checks. That may be useful in some scenarios but it’s not something our users experience and we may want to filter them out. To do it we would just add an additional condition to the where clause:

start_time > NOW() - INTERVAL '1 hour' AND
parent_span_id = 0 AND
span_name NOT LIKE '%health%'


Query 3: Identify what services generate more load on other services by operation in the last 30 minutes

In microservice architectures, there are many internal calls between services. One of those microservices could be going through a lot of load. That may be a service that is used by several other services for performing different operations and we would not immediately know what’s causing that.

Using the query below we can quickly list all dependencies across services and get an understanding of not only who is calling who and how often, but also what are the specific operations involved in those calls and how long the execution of those calls is taking in aggregate (in seconds). This serves as an additional indicator of load since some types of requests could be much more expensive than others.

SELECT
client_span.service_name AS client_service,
server_span.service_name AS server_service,
server_span.span_name AS server_operation,
count(*) AS number_of_requests,
ROUND(sum(server_span.duration_ms)::numeric) AS total_exec_time
FROM
span AS server_span
JOIN span AS client_span
ON server_span.parent_span_id = client_span.span_id
WHERE
client_span.start_time > NOW() - INTERVAL '30 minutes' AND
client_span.service_name != server_span.service_name
GROUP BY
client_span.service_name,
server_span.service_name,
server_span.span_name
ORDER BY
server_service,
server_operation,
number_of_requests DESC;


In this query we are leveraging another powerful capability of SQL: joins. We are joining the span table with itself to identify only the spans that represent a call between two services. This is what the condition client_span.service_name != server_span.service_name accomplishes.

The result would look something like the following:

 client_service | server_service |                    server_operation                     | number_of_requests | total_exec_time
----------------+----------------+---------------------------------------------------------+--------------------+-----------------
frontend       | cart           | hipstershop.CartService/AddItem                         |                509 |               5
checkout       | cart           | hipstershop.CartService/EmptyCart                       |                174 |               2
frontend       | cart           | hipstershop.CartService/GetCart                         |               3697 |              25
checkout       | cart           | hipstershop.CartService/GetCart                         |                174 |               2
frontend       | checkout       | hipstershop.CheckoutService/PlaceOrder                  |                174 |              57
frontend       | currency       | grpc.hipstershop.CurrencyService/Convert                |               8635 |              14
checkout       | currency       | grpc.hipstershop.CurrencyService/Convert                |                408 |               1
frontend       | currency       | grpc.hipstershop.CurrencyService/GetSupportedCurrencies |               3876 |               3
checkout       | email          | /hipstershop.EmailService/SendOrderConfirmation         |                174 |               0
checkout       | payment        | grpc.hipstershop.PaymentService/Charge                  |                174 |               0
frontend       | productcatalog | hipstershop.ProductCatalogService/GetProduct            |              20436 |               1
checkout       | productcatalog | hipstershop.ProductCatalogService/GetProduct            |                234 |               0
frontend       | productcatalog | hipstershop.ProductCatalogService/ListProducts          |                501 |               0
frontend       | recommendation | /hipstershop.RecommendationService/ListRecommendations  |               3374 |              15
frontend       | shipping       | hipstershop.ShippingService/GetQuote                    |               1027 |               0
checkout       | shipping       | hipstershop.ShippingService/GetQuote                    |                174 |               0
checkout       | shipping       | hipstershop.ShippingService/ShipOrder                   |                174 |               0


These results show the client service making the request in the first column and the service and operation receiving the request in the second and third columns. We see for example that the GetProduct method of the ProductCatalogService has been called from the frontend service and the checkout service and that the former made many more calls in the last 30 minutes, which is expected and not an issue in this case. But if we saw a much higher percentage of calls to the GetProduct method originating from the checkout service this would be an indicator of something unexpected going on.

Conclusion

Traces are extremely useful to troubleshoot and understand modern distributed systems. They help us answer questions that are impossible or very hard to answer with just metrics and logs. Adoption of traces has been traditionally slow because trace instrumentation has required a lot of manual effort and existing observability tools have not allowed us to query the data in flexible ways to get all the value traces can provide.

This is not true anymore thanks to OpenTelemetry and Promscale.

OpenTelemetry is quickly becoming the instrumentation standard and it offers libraries that automate (or at the very least simplify) trace instrumentation dramatically reducing the amount of effort required to instrument our services. Additionally, the instrumentation is vendor agnostic and the traces it generates can be sent to any compatible observability backend so we can change backends or use multiple ones.

Promscale is the observability backend with full SQL support for querying traces and metrics. With Promscale you can interrogate your data to answer any question you need to answer.

The combined power of OpenTelemetry and Promscale help you get more value out of traces and build better software.

Get started with Promscale

To start getting more value out of your traces and metrics with Promscale:

Whether you’re new to Promscale or an existing community member, we’d love to hear from you! Join TimescaleDB Slack, where you’ll find 7K+ developers and Timescale engineers active in all channels. (The dedicated #promscale channel has 2.5K+ members, so it’s a great place to connect with like-minded community members, ask questions, get advice, and more).

This post was written by