OpenTelemetry Metrics

OpenTelemetry Metrics available in gRPC

OpenTelemetry Metrics

OpenTelemetry Metrics available in gRPC

Overview

gRPC provides support for an OpenTelemetry plugin that provides metrics that can help you -

  • Troubleshoot your system
  • Iterate on improving system performance
  • Setup continuous monitoring and alerting.

Background

OpenTelemetry is an observability framework to create and manage telemetry data. gRPC previously provided observability support through OpenCensus which has been sunsetted in the favor of OpenTelemetry.

Instruments

The gRPC OpenTelemetry plugin accepts a MeterProvider and depends on the OpenTelemetry API to create a Meter that identifies the gRPC library being used, for example, grpc-c++ at version 1.57.1. The following listed instruments are created using this meter. Users should employ the OpenTelemetry SDK to customize the views exported by OpenTelemetry.

More and more gRPC components are being instrumented for observability. Currently, we have the following components instrumented -

  • Per-call (stable, on by default) : Observe RPCs themselves (for example, latency.)
    • Client Per-Call : Observe a client call
    • Client Per-Attempt : Observe attempts for a client call, since a call can have multiple attempts due to retry or hedging.
    • Server : Observe a call received at the server.
  • LB Policy : Observe various load-balancing policies
    • Weighted Round Robin (experimental)
    • Pick-First (experimental)
  • XdsClient (experimental)

NOTE Some instruments are off by default and need to be explicitly enabled from the gRPC OpenTelemetry plugin API. Experimental metrics are always off by default. (Reference C++ API)

Per-Call Metrics

Client Per-Call Instruments

NameTypeUnitLabels (required)Description
grpc.client.call.durationHistogramsgrpc.method, grpc.target , grpc.statusThis metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.

Refer A66: OpenTelemetry Metrics for details.

Client Per-Attempt Instruments

NameTypeUnitLabels (disposition)Description
grpc.client.attempt.
started
Counter{attempt}grpc.method (required), grpc.target (required)The total number of RPC attempts started, including those that have not completed.
grpc.client.attempt.
duration
Histogramsgrpc.method (required), grpc.target (required), grpc.status (required), grpc.lb.locality (optional)End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
grpc.client.attempt.
sent_total_compressed_message_size
HistogramBygrpc.method (required), grpc.target (required), grpc.status (required), grpc.lb.locality (optional)Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
grpc.client.attempt.
rcvd_total_compressed_message_size
HistogramBygrpc.method (required), grpc.target (required), grpc.status (required), grpc.lb.locality (optional)Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.

Refer A66: OpenTelemetry Metrics for details.

Server Instruments

NameTypeUnitLabels (required)Description
grpc.server.call.
started
Counter{call}grpc.methodThe total number of RPCs started, including those that have not completed.
grpc.server.call.
sent_total_compressed_message_size
HistogramBygrpc.method, grpc.statusTotal bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
grpc.server.call.
rcvd_total_compressed_message_size
HistogramBygrpc.method, grpc.statusTotal bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
grpc.server.call.
duration
Histogramsgrpc.method, grpc.statusThis metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc) perspective.

Refer A66: OpenTelemetry Metrics for details.

LB Policy Instruments

Weighted Round Robin LB Policy Instruments

NameTypeUnitLabels (disposition)Description
grpc.lb.wrr.
rr_fallback
Counter{update}grpc.target (required), grpc.lb.locality (optional)EXPERIMENTAL: Number of scheduler updates in which there were not enough endpoints with valid weight, which caused the WRR policy to fall back to RR behavior.
grpc.lb.wrr.
endpoint_weight_not_yet_usable
Counter{endpoint}grpc.target (required), grpc.lb.locality (optional)EXPERIMENTAL: Number of endpoints from each scheduler update that don’t yet have usable weight information (i.e., either the load report has not yet been received, or it is within the blackout period).
grpc.lb.wrr.
endpoint_weight_stale
Counter{endpoint}grpc.target (required), grpc.lb.locality (optional)EXPERIMENTAL: Number of endpoints from each scheduler update whose latest weight is older than the expiration period.
grpc.lb.wrr.
endpoint_weights
Histogram{weight}grpc.target (required), grpc.lb.locality (optional)EXPERIMENTAL: Weight of an endpoint recorded every scheduler update.

Refer A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient for details.

Pick First LB Policy Instruments

NameTypeUnitLabels (required)Description
grpc.lb.pick_first.
disconnections
Counter{disconnection}grpc.targetEXPERIMENTAL: Number of times the selected subchannel becomes disconnected.
grpc.lb.pick_first.
connection_attempts_succeeded
Counter{attempt}grpc.targetEXPERIMENTAL: Number of successful connection attempts.
grpc.lb.pick_first.
connection_attempts_failed
Counter{attempt}grpc.targetEXPERIMENTAL: Number of failed connection attempts.

Refer A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient for details.

XdsClient Instruments

NameTypeUnitLabels (required)Description
grpc.xds_client.
connected
Gauge{bool}grpc.target, grpc.xds.serverEXPERIMENTAL: Whether or not the xDS client currently has a working ADS stream to the xDS server.
grpc.xds_client.
server_failure
Counter{failure}grpc.target, grpc.xds.serverEXPERIMENTAL: A counter of xDS servers going from healthy to unhealthy.
grpc.xds_client.
resource_updates_valid
Counter{resource}grpc.target, grpc.xds.server, grpc.xds.resource_typeEXPERIMENTAL: A counter of resources received that were considered valid, even if unchanged.
grpc.xds_client.
resource_updates_invalid
Counter{resource}grpc.target, grpc.xds.server, grpc.xds.resource_typeEXPERIMENTAL: A counter of resources received that were considered invalid.
grpc.xds_client.
resources
Gauge{resource}grpc.target, grpc.xds.authority, grpc.xds.cache_state, grpc.xds.resource_typeEXPERIMENTAL: Number of xDS resources.

Refer A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient for details.

Labels/Attributes

With a recorded measurement for an instrument, gRPC might provide some additional information as attributes or labels. For example, grpc.client.attempt.started has the labels grpc.method and grpc.target along with each measurement that tell us the method and the target associated with the RPC attempt being observed.

NOTE Some attributes are marked as optional on the instruments. These need to be explicitly enabled from the gRPC OpenTelemetry Plugin API. (Reference C++ API)

NameDescription
grpc.methodFull gRPC method name, including package, service and method, e.g. “google.bigtable.v2.Bigtable/CheckAndMutateRow”.
grpc.statusgRPC server status code received, e.g. “OK”, “CANCELLED”, “DEADLINE_EXCEEDED”.
grpc.targetCanonicalized target URI used when creating gRPC Channel, e.g. “dns:///pubsub.googleapis.com:443”, “xds:///helloworld-gke:8000”.
grpc.lb.localityThe locality to which the traffic is being sent.
grpc.xds.serverFor clients, indicates the target of the gRPC channel in which the XdsClient is used. For servers, will be the string “#server”.
grpc.xds.authorityThe xDS authority. The value will be “#old” for old-style non-xdstp resource names.
grpc.xds.cache_stateIndicates the cache state of an xDS resource (“requested”, “does_not_exist”, “acked”, “nacked”, “nacked_but_cached”).
grpc.xds.resource_typexDS resource type, such as “envoy.config.listener.v3.Listener”.

FAQ

Q. How do I get throughput or QPS (queries per second)?

Use a count aggregation on the latency histogram metrics - grpc.client.attempt.duration / grpc.client.call.duration (for clients) or grpc.server.call.duration (for servers).

Q. How do I get error rate for RPCs?

Error counts can be calculated by using a filter grpc.status != OK value on the latency histogram metrics grpc.client.attempt.duration / grpc.client.call.duration (for clients) or grpc.server.call.duration (for servers).

Language examples

LanguageExample
C++C++ Example
GoGo Example
JavaJava Example
PythonPython Example

Additional Resources