What are the four pillars of observability in 2026?

Traces show request journeys across services, metrics show aggregated measurements, logs show discrete events, and continuous profiles show which code path consumes resources.

Observability Architecture in 2026: OpenTelemetry, Continuous Profiling & Exemplars

Q: What is the difference between Prometheus/Grafana and OpenTelemetry?

Prometheus is a time-series database and query language while OpenTelemetry is an instrumentation standard and collection pipeline. They are complementary and often used together.

Q: How do you reduce observability costs without losing critical signals?

Use tail-based sampling to collect full trace context then keep only error and slow traces while discarding the majority of successful traces.

← Back to Software Architecture Hub

Observability Architecture in 2026: OpenTelemetry, Continuous Profiling & Exemplars

Monitoring vs Observability: A Precise Definition
The Crisis of Siloed Signals
The Four Pillars of Observability in 2026
OpenTelemetry: The Universal Standard
The OTel Collector: Your Observability Router
Implementing OTel in Code: Auto vs Manual Instrumentation
Exemplars: Linking Metrics to Traces
Continuous Profiling: The Fourth Pillar
SLO-Driven Alerting and Error Budgets
Tail-Based Sampling: Controlling Cost Without Losing Signal
The Observability Stack in 2026
Frequently Asked Questions
Key Takeaway

Monitoring vs Observability: A Precise Definition

	Monitoring	Observability
Approach	Pre-defined dashboards for known failure modes	Answer arbitrary questions about system behaviour
Question type	"Is the disk full?" (known unknowns)	"Why is this user getting errors only on iOS?" (unknown unknowns)
Data	Metrics snapshot at intervals	Rich context: traces, logs, profiles, attributes
Alert style	Threshold breach -> alert	SLO breach + error budget burn rate
Investigation	Jump between dashboards	Start from symptom -> drill to cause

Monitoring tells you that something is wrong. Observability tells you why.

The Crisis of Siloed Signals

The original "three pillars" model recommended separate tools for each signal:

text

Metrics (Prometheus/Datadog) ---------------- Silo A
Logs    (Elasticsearch/Splunk) --------------- Silo B
Traces  (Jaeger/Zipkin) ---------------------- Silo C

During an incident at 3am:

Alert fires: p99 latency > 5 seconds
Open Datadog -> find spike on checkout-service
Switch to Splunk -> search logs for checkout-service errors (10 minutes)
Open Jaeger -> find slow traces for checkout service (another 5 minutes)
Still don't know which line of code is slow (no profiles)

Mean Time to Diagnose: 45+ minutes

The problem is correlation - each tool has its own time axis, its own request IDs, its own service naming. Stitching the picture together manually is the bottleneck.

The Four Pillars of Observability in 2026

Traces answer: "What path did this request take and how long did each hop take?"
Metrics answer: "How is the system performing in aggregate over time?"
Logs answer: "What discrete events happened and in what context?"
Profiles answer: "Which line of code is consuming the most CPU/memory?"

Without all four - and the links between them - you cannot fully answer "Why is my system slow?"

OpenTelemetry: The Universal Standard

OpenTelemetry (OTel) is a CNCF project that provides a single, vendor-neutral API and SDK for generating traces, metrics, and logs from your code. Before OTel, each observability vendor had their own SDK - switching vendors meant rewriting all instrumentation.

text

Before OTel:                         With OTel:
Jaeger SDK -> Jaeger only             OTel SDK -> OTel Collector -> Any backend
Datadog SDK -> Datadog only           One API -> Vendor-neutral -> Honeycomb/Grafana/Datadog/Jaeger
New Relic SDK -> NR only              Switch backends by changing Collector config

The OTel Collector: Your Observability Router

The OTel Collector is a pipeline that receives telemetry, processes it, and exports to one or more backends:

yaml

# otel-collector-config.yaml
receivers:
  otlp:                            # Receive from apps using OTLP protocol
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:                           # Batch for efficiency
    timeout: 1s
    send_batch_size: 1024
  
  tail_sampling:                   # Sample only interesting traces (costly requests, errors)
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-low
        type: probabilistic
        probabilistic: {sampling_percentage: 1}  # 1% of normal traces

exporters:
  otlp/tempo:
    endpoint: http://tempo:4317    # Traces -> Grafana Tempo
  prometheusremotewrite:
    endpoint: http://mimir/api/v1/push  # Metrics -> Grafana Mimir
  loki:
    endpoint: http://loki:3100    # Logs -> Grafana Loki

service:
  pipelines:
    traces:   {receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlp/tempo]}
    metrics:  {receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite]}
    logs:     {receivers: [otlp], processors: [batch], exporters: [loki]}

Implementing OTel in Code: Auto vs Manual Instrumentation

python

# Auto-instrumentation (zero code changes - inject as agent):
# pip install opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-sqlalchemy
# OTEL_SERVICE_NAME=checkout-service opentelemetry-instrument python app.py
# -> Automatically instruments HTTP, SQLAlchemy, Redis, httpx calls

# Manual instrumentation for custom business logic:
from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("checkout.service")

async def process_payment(order_id: str, amount: float):
    with tracer.start_as_current_span("process_payment") as span:
        # Add business context as span attributes:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "GBP")
        
        try:
            result = await stripe_client.charge(amount)
            span.set_attribute("payment.stripe_id", result.id)
            return result
        except stripe.CardError as e:
            # Mark span as error with details:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise PaymentFailedException(str(e)) from e

Exemplars: Linking Metrics to Traces

Exemplars are the breakthrough that connects aggregate metrics to individual traces:

text

Without Exemplars:
  Graph: p99 latency = 8.2 seconds (spike at 14:32)
  Action: Manually search Jaeger for traces around 14:32 -> 10 minutes

With Exemplars:
  Graph: click the spike at 14:32
  -> Tooltip shows "TraceID: abc123 (duration: 8.2s)"
  -> Click TraceID -> Jaeger opens that exact trace instantly
  Action: 10 seconds to root cause

python

# Prometheus exemplar with trace ID linkage:
from prometheus_client import Histogram
from opentelemetry import trace

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint', 'status']
)

def record_request(method, endpoint, status, duration):
    # Attach current trace ID as exemplar:
    current_span = trace.get_current_span()
    trace_id = format(current_span.get_span_context().trace_id, '032x')
    
    REQUEST_DURATION.labels(method, endpoint, status).observe(
        duration,
        exemplar={'traceID': trace_id}  # Links this data point to the trace
    )

Continuous Profiling: The Fourth Pillar

Continuous Profiling answers: "Which line of code is consuming resources?" Traditional profiling tools (sampling profilers) are run manually and slow everything down. eBPF-based profilers run continuously in production with < 0.1% CPU overhead:

Tool	Technology	Overhead	Languages
Parca	eBPF (Linux kernel)	< 0.1% CPU	Any (compiled + JVM + Python)
Pyroscope	Pull-based + push SDKs	< 1%	Go, Java, Python, Ruby, .NET
Grafana Beyla	eBPF auto-instrumentation	~1%	Go, Python, Java, Node, Ruby
Go pprof	Language built-in	Higher (on-demand)	Go only

SLO-Driven Alerting and Error Budgets

Modern observability alerts on SLO burn rate, not raw metrics:

yaml

# Prometheus SLO alerting (using pyrra or sloth):
# SLO: 99.9% of requests succeed in < 500ms over 30 days

# Error budget: 0.1% of 30-day requests = 43.2 minutes of allowed downtime

# Alert 1: Fast burn (critical - page immediately)
# If current burn rate would exhaust 5% of error budget in 1 hour:
- alert: HighErrorRateFastBurn
  expr: |
    (rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.001 * 14.4
  # 14.4 = (5% of budget / 1 hour) / (1 hour / 30 days)
  severity: critical

# Alert 2: Slow burn (warning - needs same-day attention)
# If current burn rate would exhaust 10% of budget in 6 hours:
- alert: HighErrorRateSlowBurn
  expr: |
    (rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h])) > 0.001 * 6
  severity: warning

Frequently Asked Questions

What's the difference between Prometheus/Grafana and OpenTelemetry? Prometheus is a time-series database and query language (PromQL). OpenTelemetry is an instrumentation standard and collection pipeline. They are complementary: OTel collector can scrape Prometheus metrics, and Grafana can display OTel-sourced metrics from Prometheus or Mimir. In 2026, the standard stack is OTel SDKs for instrumentation -> OTel Collector for processing -> Grafana stack (Tempo, Mimir, Loki) for storage and visualisation.

How do I reduce observability costs without losing critical signals? Tail-based sampling is the most effective technique: collect ALL trace context at the edge, then make a sampling decision only after the full trace is assembled (so you can keep all error traces and slow traces while discarding 99% of successful fast traces). Also: use log levels properly (DEBUG only in staging, WARN/ERROR in production), aggregate metrics instead of emitting a histogram for every request, and delete old data aggressively (90-day retention covers most incidents).

Key Takeaway

Observability architecture is not a tooling decision - it's a cultural and architectural commitment. The investment in OTel instrumentation, exemplar linking, and continuous profiling pays off the first time you diagnose a production incident in 3 minutes instead of 3 hours. In 2026, the observability stack is increasingly commoditised (OpenTelemetry + Grafana or any vendor that accepts OTLP), and the differentiation comes from instrumentation quality - what attributes you add to spans, how rich your log context is, and whether you've linked your metrics to traces with exemplars.

Part of the Software Architecture Hub - comprehensive guides from architectural foundations to advanced distributed systems patterns.

Observability Architecture in 2026: OpenTelemetry, Continuous Profiling & Exemplars

Table of Contents

Monitoring vs Observability: A Precise Definition

The Crisis of Siloed Signals

The Four Pillars of Observability in 2026

OpenTelemetry: The Universal Standard

The OTel Collector: Your Observability Router

Implementing OTel in Code: Auto vs Manual Instrumentation

Exemplars: Linking Metrics to Traces

Continuous Profiling: The Fourth Pillar

SLO-Driven Alerting and Error Budgets

Frequently Asked Questions

Key Takeaway