How to Monitor Microservices Without Drowning in Dashboards
You already have metrics. What you probably lack is a correlation contract — one collector, three signals, sampling at the tail.
6 min read · May 29, 2026
#Programming #DistributedSystems #OpenTelemetry #Observability #DevOps

Most teams running microservices know what Prometheus is. Fewer can open one slow checkout trace and land on the exact log line that proves which downstream call stalled. That gap is not a tooling gap — it's an architecture gap. You instrumented services individually, bought a dashboard product, and called it observability.
Monitoring a microservices architecture in production means shipping metrics, traces, and logs through a single export path, making them correlatable, and sampling before storage eats your budget. Everything else — Grafana vs Datadog, Tempo vs Jaeger — is a backend choice you can change without re-touching application code if you get the collector layer right OpenTelemetry primer Collector docs.
Buying another dashboard license doesn't close that gap. Correlation does.
This is the practical guide: week-one rollout order, where OpenTelemetry sits, what RED actually maps to on alerts, and the failure modes that survive postmortems.
Week one — minimum viable observability stack
Don't boil the ocean. The sequence that actually sticks:
- Pick one critical path — checkout, auth, or payments. Not "all forty services."
- Instrument with OpenTelemetry — auto-instrumentation for HTTP/gRPC on that service first; manual spans only on business steps (e.g.
charge_card) OpenTelemetry primer. - Run an OpenTelemetry Collector — apps export OTLP to a local agent (DaemonSet on Kubernetes); agent forwards to a gateway collector for batching and sampling Collector docs.
- Backends — Prometheus (or Mimir) for metrics, Tempo (or Jaeger) for traces, Loki (or equivalent) for logs. Grafana ties them together with trace-to-log links Tempo datasource.
- Define three alerts — error rate, p99 latency, saturation on that path. Not CPU on every pod.
Skip multi-cloud federation until one path is debuggable end-to-end. The CNCF migration stories all start with one service experiment — same pattern CNCF OTel case study.
If week one ends with twelve new panels and zero trace-to-log links, you built wallpaper, not observability.
The Collector Layer — Why Apps Never Talk to Grafana Directly
If your services push OTLP straight to a vendor endpoint, you've welded instrumentation to a billing contract. The collector pattern exists so instrumentation is stable and backends are swap experiments Collector docs CNCF OTel case study.
Minimum pipeline shape:
- Receivers — OTLP from apps (gRPC/HTTP).
- Processors —
memory_limiterfirst (prevents OOM cascades), thenbatch, enrichment (k8s attributes), thentail_samplingon the gateway tier tail sampling processor. - Exporters — Prometheus remote write or scrape endpoint for metrics; OTLP to Tempo for traces; Loki for logs.
Worked path — checkout order-service:
A user hits checkout. order-service emits spans for create_order and call_payment. Spans go to a node-local collector agent. The agent batches and forwards to a gateway collector. The gateway's tail sampler keeps 100% of traces with error=true or duration above 2s, and ~5% of successful traces. Metrics still aggregate everything — you don't sample away error counters.
That's how you debug the one failed payment without storing ten million happy paths.
[Community-sourced] Kubernetes OTLP guides repeat the same anti-pattern: apps exporting directly to vendors, then re-instrumenting on every migration. Don't.
The collector is the migration unit — not "we added OTel to one repo."
Three signals — what each one is for
Metrics tell you the system is on fire in aggregate — request rate, error ratio, histogram latency. Prometheus pull scrapes work well for long-running microservices; short jobs need a push gateway or OTLP metrics export before exit Prometheus overview.
Traces tell you which hop in the mesh burned the budget. A trace is a tree of spans; the root span is the user request, children are downstream calls OpenTelemetry primer. Without traces, "checkout is slow" becomes a Slack argument.
Logs are still necessary — but only if they carry trace_id and span_id in structured JSON so Grafana can pivot from a slow span to the pod log line at that millisecond Tempo datasource. Logs without trace correlation are just noise with timestamps.
Three signals, one request ID — or you're still doing whack-a-mole by service name.
The RED Slot Map — What to Alert On
RED (Rate, Errors, Duration) is the service-level checklist for user-facing microservices. It aligns with the four golden signals mindset — latency, traffic, errors, saturation — without requiring four separate playbooks on call Google SRE monitoring.
Worked example — payments-api:
- Slot 1 — Rate — request count over time. Alert on absence during business hours (scrape failure) and on traffic anomalies only when you have seasonality baselines.
- Slot 2 — Errors — ratio of 5xx to total requests over 5m. Page when > 2% for ten minutes, not on the first blip.
- Slot 3 — Duration — histogram p99 over 5m. Page when p99 > 800ms and error rate is elevated — duration alone catches slow success paths that still lose money.
RED is per service, not per pod. Pod CPU alerts are saturation signals for platform teams; RED is how product engineering knows the API contract broke.
Wire alerts through Alertmanager so pages go to owners of the service, not the cluster Alertmanager overview.
Sampling — keep the failures, drop the noise
100% trace retention in a high-QPS mesh is how observability budgets die. Tail sampling at the gateway collector is the production pattern: decide keep/drop after the full trace arrives, keep errors and slow traces, probabilistic-sample the rest tail sampling processor.
Head sampling at the SDK is fine for cost guardrails early, but it can discard entire failed traces before you see the error span. Use both layers thoughtfully — sample for volume at the SDK, tail-sample for fidelity at the gateway.
Policies worth defaulting:
- Keep all traces with
status=ERROR. - Keep traces where root duration exceeds your p99 SLO.
- Sample 1–5% of success paths.
Monitor the collector itself — scrape its metrics, alert on dropped spans and queue depth. The pipeline you trust during outages must not be silently discarding data.
Cardinality and cost landmines
- Never put unbounded
user_idor full URL paths on metric labels — Prometheus cardinality explodes Prometheus overview. - Do use semantic conventions for HTTP routes (
/orders/{id}not/orders/1842) semantic conventions. - Don't enable every auto-instrumentation library — some generate thousands of low-value spans per request.
- Do validate instrumentation in staging with trace volume estimates before production rollout.
Prometheus is built for outage diagnosis when other systems fail — which is exactly when you need it — but it is the wrong tool for penny-perfect per-request billing ledgers Prometheus overview. Know which problem you're solving.
What "done" looks like
You can answer, for one request ID: which services did it touch, which span failed, what did the structured log say, and whether error rate or latency SLOs burned — without SSHing into a box.
Instrument once with OpenTelemetry. Route through a collector. Map RED to alerts. Sample at the tail. Correlation is the product; dashboards are just the UI.
Metrics tell you that it's broken. Traces tell you where. Logs tell you why — if you bothered to wire the IDs.
More in Build
Your Cache Hit Rate Looked Fine Until the Hour Mark
Redis did its job on every miss — your application just sent two hundred loaders to Postgres at once.
6 min · June 15, 2026
PHP Turns 31 — The History That Matters Is the Elephant
The version timeline is everywhere. The resume logger, the Usenet post, and the sideways doodle that became a mascot — that's the birthday story worth telling.
6 min · June 10, 2026
BullMQ Background Jobs That Survive Production
Retries with an error taxonomy, deduplication that survives cleanup, and a dead-letter queue someone actually inspects — not a five-minute `Queue` demo.
6 min · June 6, 2026