Observability: Logging, Metrics, and Tracing

TT
Observability: Logging, Metrics, and Tracing

Observability: Logging, Metrics, and Tracing

Monitoring tells you something is wrong. Observability tells you why. A system is observable when you can answer questions about its internal state from its external outputs—without needing to add new instrumentation every time a new failure mode appears. The three pillars are logs (what happened), metrics (how often / how long), and traces (the path of a single request through distributed services).


The Three Pillars

text
Logs:     "UserService returned 500 for userId=abc at 14:23:01.445"
          → What happened, in detail

Metrics:  http_requests_total{status="500"} = 847 (in last 5 min)
          → Aggregated trends, suitable for alerting

Traces:   Request abc-123 → API Gateway (2ms) → UserService (45ms)
                         → Database (38ms of that 45ms)
          → Latency breakdown across services for a single request

Each pillar complements the others:

  • Metrics alert you that P99 latency spiked
  • Traces show you which service is slow
  • Logs show you the exact error in that service

Structured Logging

Plain text logs are human-readable but machine-unqueryable. Structured logs emit JSON, enabling powerful aggregation and filtering.

typescript
// Bad: unstructured log
console.log(`Error processing order ${orderId} for user ${userId}: ${error.message}`);
// Output: "Error processing order ord-123 for user usr-456: timeout"
// Hard to query by orderId in Datadog/CloudWatch/Loki

// Good: structured log
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ['req.headers.authorization', 'user.password'], // never log secrets
});

logger.error(
  {
    orderId: 'ord-123',
    userId: 'usr-456',
    error: { message: error.message, stack: error.stack, code: error.code },
    durationMs: 3021,
    service: 'order-service',
    traceId: ctx.traceId,   // correlate with distributed traces
  },
  'Order processing failed'
);

Output:

json
{
  "level": "error",
  "time": 1714300981445,
  "orderId": "ord-123",
  "userId": "usr-456",
  "error": { "message": "timeout", "code": "ETIMEDOUT" },
  "durationMs": 3021,
  "service": "order-service",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "msg": "Order processing failed"
}

Log Levels and When to Use Each

LevelUse CaseExample
traceVerbose debugging, disabled in prod"Entering function calculateTax"
debugDevelopment diagnostics"Cache miss for key user:123"
infoSignificant business events"Order created: ord-123, $99.99"
warnDegraded state, not yet broken"Retry 2/3 for payment service"
errorOperation failed, user impacted"Order processing failed: timeout"
fatalService cannot continue"Cannot connect to database, shutting down"

Request Context Propagation

Every log line should include the trace ID and request ID so you can correlate all logs for a single request:

typescript
import { AsyncLocalStorage } from 'async_hooks';

const requestContext = new AsyncLocalStorage<{ traceId: string; requestId: string }>();

// Express middleware
app.use((req, res, next) => {
  const traceId = req.headers['x-trace-id'] as string ?? generateId();
  const requestId = generateId();
  
  requestContext.run({ traceId, requestId }, () => {
    // All logs within this request automatically include traceId
    next();
  });
});

// Logger wrapper that injects context
function getLogger(service: string) {
  return {
    info: (data: object, msg: string) => {
      const ctx = requestContext.getStore();
      logger.info({ ...data, ...ctx, service }, msg);
    },
    error: (data: object, msg: string) => {
      const ctx = requestContext.getStore();
      logger.error({ ...data, ...ctx, service }, msg);
    },
  };
}

Metrics with Prometheus

Prometheus scrapes a /metrics HTTP endpoint at regular intervals and stores time-series data. Four metric types:

text
Counter:   monotonically increasing (requests, errors, bytes sent)
Gauge:     current value that can go up or down (active connections, queue depth)
Histogram: distribution of values (request duration, response size)
Summary:   similar to histogram but computes quantiles client-side
typescript
import { Counter, Gauge, Histogram, Registry } from 'prom-client';

const registry = new Registry();

// HTTP request counter
const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [registry],
});

// Active connections gauge
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [registry],
});

// Request duration histogram (for percentile queries)
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

// Express middleware
app.use((req, res, next) => {
  const end = requestDuration.startTimer({ method: req.method, route: req.path });
  activeConnections.inc();
  
  res.on('finish', () => {
    end(); // records duration
    activeConnections.dec();
    httpRequests.inc({
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: res.statusCode.toString(),
    });
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

Prometheus Queries (PromQL)

promql
# Request rate over the last 5 minutes
rate(http_requests_total[5m])

# Error rate (5xx responses)
rate(http_requests_total{status_code=~"5.."}[5m])
  / rate(http_requests_total[5m])

# P99 request latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency grouped by route
histogram_quantile(0.99,
  sum by (route, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Apdex score (% of requests under 500ms)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
  sum(rate(http_request_duration_seconds_count[5m]))
)

Alerting Rules

yaml
# prometheus-alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% on {{ $labels.route }}"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency > 2s"

Distributed Tracing with OpenTelemetry

When a request touches multiple services, you need traces to see the full picture. OpenTelemetry is the open standard—vendor-neutral instrumentation that exports to Jaeger, Zipkin, DataDog, Honeycomb, and others.

text
Trace: A single request's journey
  └── Span: API Gateway (total: 52ms)
        ├── Span: UserService.getUser (8ms)
        │     └── Span: DB query (6ms)
        └── Span: OrderService.createOrder (40ms)
              ├── Span: DB insert (12ms)
              └── Span: Kafka publish (5ms)

Setup OpenTelemetry in Node.js

typescript
// otel.ts — load this BEFORE all other imports
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version,
    environment: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Automatically traces: express, http, pg, redis, kafkajs, grpc, ...
      '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Manual Spans for Business Logic

typescript
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string): Promise<void> {
  // Create a span for this business operation
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    
    try {
      const order = await fetchOrder(orderId);
      span.setAttribute('order.amount', order.amount);
      span.setAttribute('order.userId', order.userId);
      
      // Child spans are automatically created by auto-instrumentation
      // for the DB queries inside fetchOrder
      await chargePayment(order);
      await sendConfirmation(order);
      
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

The RED Method for Services

For each service, track three metrics:

text
R = Rate      → requests per second
E = Errors    → error rate
D = Duration  → distribution of latency (especially P99)

Dashboard template:
┌──────────────────────────────────────────┐
│ Service: order-service                   │
│ Rate:    342 req/s                       │
│ Errors:  0.3% (5xx / total)             │
│ P50:     45ms  P95: 180ms  P99: 890ms   │
└──────────────────────────────────────────┘

Grafana Dashboard as Code

json
{
  "title": "Service Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m]))",
          "legendFormat": "req/s"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status_code=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))",
          "legendFormat": "error rate"
        }
      ],
      "thresholds": {
        "steps": [
          { "value": 0.01, "color": "yellow" },
          { "value": 0.05, "color": "red" }
        ]
      }
    },
    {
      "title": "Request Duration (P99)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
          "legendFormat": "p99"
        }
      ]
    }
  ]
}

Frequently Asked Questions

Q: How much does structured logging slow down my application?

Structured logging with a high-performance library like pino (Node.js) or zerolog (Go) adds negligible overhead—typically 1–5 microseconds per log call. The bottleneck is usually I/O (writing to stdout, file, or network). Use asynchronous logging (write to a buffer, flush in background) in high-throughput paths. For the hottest code paths, guard debug-level logs: if (logger.isLevelEnabled('debug')) { logger.debug({...heavyComputation()}, 'debug info'); }.

Q: What is the difference between Prometheus histograms and summaries?

Both measure value distributions, but histograms compute quantiles server-side (in Prometheus) at query time using bucket boundaries, while summaries compute quantiles client-side in the application. Histograms are preferred because: (1) you can aggregate across instances in PromQL (histogram_quantile works across replicas), (2) you can compute any percentile at query time without reconfiguring the application, (3) they are more accurate for aggregated fleet metrics. Use summaries only when you need very accurate quantiles for a single instance and have stable, predictable distributions.

Q: How do I correlate logs, metrics, and traces in practice?

The key is including the trace ID in every log line. When an alert fires on a metrics spike, you can look at traces from that time window to find slow spans, then look at logs filtered by the trace IDs of those slow traces to find the specific error messages. OpenTelemetry's auto-instrumentation automatically propagates trace context via traceparent HTTP headers between services. In Grafana, the "Explore" view lets you jump from a metric to logs and traces in the same time window with one click when you have Loki (logs) and Tempo (traces) configured alongside Prometheus.

Q: What should I alert on vs. just record?

Alert on symptoms (user-visible impact), not causes. The SRE "golden signals" are the right alert targets: error rate, latency (P99), saturation (queue depth, CPU above 90%), and traffic (anomalous drop). Avoid alerting on cause-level metrics like CPU spikes or memory usage in isolation—these often resolve without user impact. Record everything; alert sparingly. A well-tuned system has fewer than 5 active alert rules. More than that and on-call engineers start ignoring alerts—which defeats the purpose.