BuildDoing the work

How to Monitor Microservices Without Drowning in Dashboards

You already have metrics. What you probably lack is a correlation contract — one collector, three signals, sampling at the tail.

Serhii MalyshevSoftware architect and tech lead6 min read · May 29, 2026

#Programming #DistributedSystems #OpenTelemetry #Observability #DevOps

Trace on the left, RED burn on the right — correlation, not wallpaper.

Most teams running microservices know what Prometheus is. Fewer can open one slow checkout trace and land on the exact log line that proves which downstream call stalled. That gap is not a tooling gap — it's an architecture gap. You instrumented services individually, bought a dashboard product, and called it observability.

Monitoring a microservices architecture in production means shipping metrics, traces, and logs through a single export path, making them correlatable, and sampling before storage eats your budget. Everything else — Grafana vs Datadog, Tempo vs Jaeger — is a backend choice you can change without re-touching application code if you get the collector layer right OpenTelemetry primer Collector docs.

Buying another dashboard license doesn't close that gap. Correlation does.

This is the practical guide: week-one rollout order, where OpenTelemetry sits, what RED actually maps to on alerts, and the failure modes that survive postmortems.

Week one — minimum viable observability stack

Don't boil the ocean. The sequence that actually sticks:

Pick one critical path — checkout, auth, or payments. Not "all forty services."
Instrument with OpenTelemetry — auto-instrumentation for HTTP/gRPC on that service first; manual spans only on business steps (e.g. charge_card) OpenTelemetry primer.
Run an OpenTelemetry Collector — apps export OTLP to a local agent (DaemonSet on Kubernetes); agent forwards to a gateway collector for batching and sampling Collector docs.
Backends — Prometheus (or Mimir) for metrics, Tempo (or Jaeger) for traces, Loki (or equivalent) for logs. Grafana ties them together with trace-to-log links Tempo datasource.
Define three alerts — error rate, p99 latency, saturation on that path. Not CPU on every pod.

Skip multi-cloud federation until one path is debuggable end-to-end. The CNCF migration stories all start with one service experiment — same pattern CNCF OTel case study.

If week one ends with twelve new panels and zero trace-to-log links, you built wallpaper, not observability.

The Collector Layer — Why Apps Never Talk to Grafana Directly

If your services push OTLP straight to a vendor endpoint, you've welded instrumentation to a billing contract. The collector pattern exists so instrumentation is stable and backends are swap experiments Collector docs CNCF OTel case study.

Minimum pipeline shape:

Receivers — OTLP from apps (gRPC/HTTP).
Processors — memory_limiter first (prevents OOM cascades), then batch, enrichment (k8s attributes), then tail_sampling on the gateway tier tail sampling processor.
Exporters — Prometheus remote write or scrape endpoint for metrics; OTLP to Tempo for traces; Loki for logs.

Worked path — checkout order-service:

A user hits checkout. order-service emits spans for create_order and call_payment. Spans go to a node-local collector agent. The agent batches and forwards to a gateway collector. The gateway's tail sampler keeps 100% of traces with error=true or duration above 2s, and ~5% of successful traces. Metrics still aggregate everything — you don't sample away error counters.

That's how you debug the one failed payment without storing ten million happy paths.

[Community-sourced] Kubernetes OTLP guides repeat the same anti-pattern: apps exporting directly to vendors, then re-instrumenting on every migration. Don't.

The collector is the migration unit — not "we added OTel to one repo."

Three signals — what each one is for

Metrics tell you the system is on fire in aggregate — request rate, error ratio, histogram latency. Prometheus pull scrapes work well for long-running microservices; short jobs need a push gateway or OTLP metrics export before exit Prometheus overview.

Traces tell you which hop in the mesh burned the budget. A trace is a tree of spans; the root span is the user request, children are downstream calls OpenTelemetry primer. Without traces, "checkout is slow" becomes a Slack argument.

Logs are still necessary — but only if they carry trace_id and span_id in structured JSON so Grafana can pivot from a slow span to the pod log line at that millisecond Tempo datasource. Logs without trace correlation are just noise with timestamps.

Three signals, one request ID — or you're still doing whack-a-mole by service name.

The RED Slot Map — What to Alert On

RED (Rate, Errors, Duration) is the service-level checklist for user-facing microservices. It aligns with the four golden signals mindset — latency, traffic, errors, saturation — without requiring four separate playbooks on call Google SRE monitoring.

Worked example — payments-api:

Slot 1 — Rate — request count over time. Alert on absence during business hours (scrape failure) and on traffic anomalies only when you have seasonality baselines.
Slot 2 — Errors — ratio of 5xx to total requests over 5m. Page when > 2% for ten minutes, not on the first blip.
Slot 3 — Duration — histogram p99 over 5m. Page when p99 > 800ms and error rate is elevated — duration alone catches slow success paths that still lose money.

RED is per service, not per pod. Pod CPU alerts are saturation signals for platform teams; RED is how product engineering knows the API contract broke.

Wire alerts through Alertmanager so pages go to owners of the service, not the cluster Alertmanager overview.

Sampling — keep the failures, drop the noise

100% trace retention in a high-QPS mesh is how observability budgets die. Tail sampling at the gateway collector is the production pattern: decide keep/drop after the full trace arrives, keep errors and slow traces, probabilistic-sample the rest tail sampling processor.

Head sampling at the SDK is fine for cost guardrails early, but it can discard entire failed traces before you see the error span. Use both layers thoughtfully — sample for volume at the SDK, tail-sample for fidelity at the gateway.

Policies worth defaulting:

Keep all traces with status=ERROR.
Keep traces where root duration exceeds your p99 SLO.
Sample 1–5% of success paths.

Monitor the collector itself — scrape its metrics, alert on dropped spans and queue depth. The pipeline you trust during outages must not be silently discarding data.

Cardinality and cost landmines

Never put unbounded user_id or full URL paths on metric labels — Prometheus cardinality explodes Prometheus overview.
Do use semantic conventions for HTTP routes (/orders/{id} not /orders/1842) semantic conventions.
Don't enable every auto-instrumentation library — some generate thousands of low-value spans per request.
Do validate instrumentation in staging with trace volume estimates before production rollout.

Prometheus is built for outage diagnosis when other systems fail — which is exactly when you need it — but it is the wrong tool for penny-perfect per-request billing ledgers Prometheus overview. Know which problem you're solving.

What "done" looks like

You can answer, for one request ID: which services did it touch, which span failed, what did the structured log say, and whether error rate or latency SLOs burned — without SSHing into a box.

Instrument once with OpenTelemetry. Route through a collector. Map RED to alerts. Sample at the tail. Correlation is the product; dashboards are just the UI.

Metrics tell you that it's broken. Traces tell you where. Logs tell you why — if you bothered to wire the IDs.

More in Build

BuildDoing

API Rate Limiting That Doesn't Break Your Clients

Token bucket or leaky bucket is the easy choice; burst semantics, quota scope, and retry behavior are the decisions your clients inherit.

6 min · July 22, 2026

BuildDoing

Your Microservices Release Process Is Missing the Composition Pin

Independent deploys are fine. Pretending each green pipeline is a release is how you recreate Tuesday's outage without knowing which Tuesday.

7 min · July 16, 2026

BuildDoing

Laravel Forge vs Vapor vs EC2 vs Fargate: Pick Ops You Can Staff

The maturity ladder is marketing. Team skill and traffic shape decide the platform.

6 min · July 13, 2026

← Back to hub