Observability Architecture in 2026: OpenTelemetry, Continuous Profiling & Exemplars

Observability Architecture in 2026: OpenTelemetry, Continuous Profiling & Exemplars
Table of Contents
- Monitoring vs Observability: A Precise Definition
- The Crisis of Siloed Signals
- The Four Pillars of Observability in 2026
- OpenTelemetry: The Universal Standard
- The OTel Collector: Your Observability Router
- Implementing OTel in Code: Auto vs Manual Instrumentation
- Exemplars: Linking Metrics to Traces
- Continuous Profiling: The Fourth Pillar
- SLO-Driven Alerting and Error Budgets
- Tail-Based Sampling: Controlling Cost Without Losing Signal
- The Observability Stack in 2026
- Frequently Asked Questions
- Key Takeaway
Monitoring vs Observability: A Precise Definition
| Monitoring | Observability | |
|---|---|---|
| Approach | Pre-defined dashboards for known failure modes | Answer arbitrary questions about system behaviour |
| Question type | "Is the disk full?" (known unknowns) | "Why is this user getting errors only on iOS?" (unknown unknowns) |
| Data | Metrics snapshot at intervals | Rich context: traces, logs, profiles, attributes |
| Alert style | Threshold breach → alert | SLO breach + error budget burn rate |
| Investigation | Jump between dashboards | Start from symptom → drill to cause |
Monitoring tells you that something is wrong. Observability tells you why.
The Crisis of Siloed Signals
The original "three pillars" model recommended separate tools for each signal:
Metrics (Prometheus/Datadog) ──────────────── Silo A
Logs (Elasticsearch/Splunk) ─────────────── Silo B
Traces (Jaeger/Zipkin) ────────────────────── Silo CDuring an incident at 3am:
- Alert fires: p99 latency > 5 seconds
- Open Datadog → find spike on
checkout-service - Switch to Splunk → search logs for
checkout-serviceerrors (10 minutes) - Open Jaeger → find slow traces for checkout service (another 5 minutes)
- Still don't know which line of code is slow (no profiles)
Mean Time to Diagnose: 45+ minutes
The problem is correlation — each tool has its own time axis, its own request IDs, its own service naming. Stitching the picture together manually is the bottleneck.
The Four Pillars of Observability in 2026
Traces answer: "What path did this request take and how long did each hop take?"
Metrics answer: "How is the system performing in aggregate over time?"
Logs answer: "What discrete events happened and in what context?"
Profiles answer: "Which line of code is consuming the most CPU/memory?"
Without all four — and the links between them — you cannot fully answer "Why is my system slow?"
OpenTelemetry: The Universal Standard
OpenTelemetry (OTel) is a CNCF project that provides a single, vendor-neutral API and SDK for generating traces, metrics, and logs from your code. Before OTel, each observability vendor had their own SDK — switching vendors meant rewriting all instrumentation.
Before OTel: With OTel:
Jaeger SDK → Jaeger only OTel SDK → OTel Collector → Any backend
Datadog SDK → Datadog only One API → Vendor-neutral → Honeycomb/Grafana/Datadog/Jaeger
New Relic SDK → NR only Switch backends by changing Collector configThe OTel Collector: Your Observability Router
The OTel Collector is a pipeline that receives telemetry, processes it, and exports to one or more backends:
# otel-collector-config.yaml
receivers:
otlp: # Receive from apps using OTLP protocol
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch: # Batch for efficiency
timeout: 1s
send_batch_size: 1024
tail_sampling: # Sample only interesting traces (costly requests, errors)
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-low
type: probabilistic
probabilistic: {sampling_percentage: 1} # 1% of normal traces
exporters:
otlp/tempo:
endpoint: http://tempo:4317 # Traces → Grafana Tempo
prometheusremotewrite:
endpoint: http://mimir/api/v1/push # Metrics → Grafana Mimir
loki:
endpoint: http://loki:3100 # Logs → Grafana Loki
service:
pipelines:
traces: {receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlp/tempo]}
metrics: {receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite]}
logs: {receivers: [otlp], processors: [batch], exporters: [loki]}Implementing OTel in Code: Auto vs Manual Instrumentation
# Auto-instrumentation (zero code changes — inject as agent):
# pip install opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-sqlalchemy
# OTEL_SERVICE_NAME=checkout-service opentelemetry-instrument python app.py
# → Automatically instruments HTTP, SQLAlchemy, Redis, httpx calls
# Manual instrumentation for custom business logic:
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("checkout.service")
async def process_payment(order_id: str, amount: float):
with tracer.start_as_current_span("process_payment") as span:
# Add business context as span attributes:
span.set_attribute("order.id", order_id)
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.currency", "GBP")
try:
result = await stripe_client.charge(amount)
span.set_attribute("payment.stripe_id", result.id)
return result
except stripe.CardError as e:
# Mark span as error with details:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise PaymentFailedException(str(e)) from eExemplars: Linking Metrics to Traces
Exemplars are the breakthrough that connects aggregate metrics to individual traces:
Without Exemplars:
Graph: p99 latency = 8.2 seconds (spike at 14:32)
Action: Manually search Jaeger for traces around 14:32 → 10 minutes
With Exemplars:
Graph: click the spike at 14:32
→ Tooltip shows "TraceID: abc123 (duration: 8.2s)"
→ Click TraceID → Jaeger opens that exact trace instantly
Action: 10 seconds to root cause# Prometheus exemplar with trace ID linkage:
from prometheus_client import Histogram
from opentelemetry import trace
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint', 'status']
)
def record_request(method, endpoint, status, duration):
# Attach current trace ID as exemplar:
current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, '032x')
REQUEST_DURATION.labels(method, endpoint, status).observe(
duration,
exemplar={'traceID': trace_id} # Links this data point to the trace
)Continuous Profiling: The Fourth Pillar
Continuous Profiling answers: "Which line of code is consuming resources?" Traditional profiling tools (sampling profilers) are run manually and slow everything down. eBPF-based profilers run continuously in production with < 0.1% CPU overhead:
| Tool | Technology | Overhead | Languages |
|---|---|---|---|
| Parca | eBPF (Linux kernel) | < 0.1% CPU | Any (compiled + JVM + Python) |
| Pyroscope | Pull-based + push SDKs | < 1% | Go, Java, Python, Ruby, .NET |
| Grafana Beyla | eBPF auto-instrumentation | ~1% | Go, Python, Java, Node, Ruby |
| Go pprof | Language built-in | Higher (on-demand) | Go only |
SLO-Driven Alerting and Error Budgets
Modern observability alerts on SLO burn rate, not raw metrics:
# Prometheus SLO alerting (using pyrra or sloth):
# SLO: 99.9% of requests succeed in < 500ms over 30 days
# Error budget: 0.1% of 30-day requests = 43.2 minutes of allowed downtime
# Alert 1: Fast burn (critical — page immediately)
# If current burn rate would exhaust 5% of error budget in 1 hour:
- alert: HighErrorRateFastBurn
expr: |
(rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.001 * 14.4
# 14.4 = (5% of budget / 1 hour) / (1 hour / 30 days)
severity: critical
# Alert 2: Slow burn (warning — needs same-day attention)
# If current burn rate would exhaust 10% of budget in 6 hours:
- alert: HighErrorRateSlowBurn
expr: |
(rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h])) > 0.001 * 6
severity: warningFrequently Asked Questions
What's the difference between Prometheus/Grafana and OpenTelemetry? Prometheus is a time-series database and query language (PromQL). OpenTelemetry is an instrumentation standard and collection pipeline. They are complementary: OTel collector can scrape Prometheus metrics, and Grafana can display OTel-sourced metrics from Prometheus or Mimir. In 2026, the standard stack is OTel SDKs for instrumentation → OTel Collector for processing → Grafana stack (Tempo, Mimir, Loki) for storage and visualisation.
How do I reduce observability costs without losing critical signals? Tail-based sampling is the most effective technique: collect ALL trace context at the edge, then make a sampling decision only after the full trace is assembled (so you can keep all error traces and slow traces while discarding 99% of successful fast traces). Also: use log levels properly (DEBUG only in staging, WARN/ERROR in production), aggregate metrics instead of emitting a histogram for every request, and delete old data aggressively (90-day retention covers most incidents).
Key Takeaway
Observability architecture is not a tooling decision — it's a cultural and architectural commitment. The investment in OTel instrumentation, exemplar linking, and continuous profiling pays off the first time you diagnose a production incident in 3 minutes instead of 3 hours. In 2026, the observability stack is increasingly commoditised (OpenTelemetry + Grafana or any vendor that accepts OTLP), and the differentiation comes from instrumentation quality — what attributes you add to spans, how rich your log context is, and whether you've linked your metrics to traces with exemplars.
Read next: Multi-Cloud Architecture Patterns: Resiliency and Global Speed →
Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.
