The 50TB problem

Somewhere in your infrastructure, there's a logging pipeline ingesting gigabytes per hour. Every request is logged. Every error is logged. Every database query, every cache hit, every HTTP response. It all flows into Elasticsearch or Splunk or CloudWatch, where it's indexed, compressed, and stored at considerable expense.

And when something breaks at 2 AM, your on-call engineer opens the log viewer and types a search query. They know the error message because a user reported it. They find the log line. They see the error. They do not see why it happened, what caused it, or which other systems were affected.

This is log-driven debugging. It works when you already know what you're looking for. It fails completely when you don't.

The three pillars and the one everyone skips

You've heard about the three pillars of observability: logs, metrics, and traces. Most teams implement them in that order and stop after the first one.

Logs tell you what happened. A log line is a point-in-time record of a discrete event. "User 4523 failed to authenticate at 14:32:07." Useful, but isolated. You can only find what you already knew to search for.

Metrics tell you what's happening now. Request rate, error rate, latency percentiles, CPU utilization. They're aggregated, cheap to store, and perfect for dashboards and alerts. But they tell you that something is wrong, not what's wrong.

Traces tell you how a request flowed through the system. A trace follows a single request from ingress through every service, database call, and queue interaction it touched. Traces are the missing link that connects "the error rate spiked" (metrics) to "request X failed in service Y because of a timeout calling service Z" (the actual root cause).

Most teams have logs. Many teams have metrics. Very few teams have traces. And traces are the one that actually makes a system observable.

Grep-driven debugging is not observability

Here's the debugging workflow in a log-only world:

  1. Alert fires: "Error rate above 5% on the checkout service."
  2. Engineer opens logs, searches for errors in the checkout service.
  3. Finds "PaymentProcessingException: timeout after 5000ms."
  4. Searches for the payment service logs around the same time.
  5. Finds "DatabaseConnectionPool: all connections busy."
  6. Searches for database logs.
  7. Finds a slow query that was holding connections.
  8. Realizes the slow query was triggered by a batch job in a completely different service.

Each step required the engineer to form a hypothesis and search for confirming evidence. The investigation took 45 minutes and relied entirely on the engineer's knowledge of the system architecture. A new team member without that mental model would still be on step 3.

With distributed tracing, the same investigation looks like this:

  1. Alert fires with a sample trace ID attached.
  2. Engineer opens the trace view.
  3. Sees the full request path: API gateway, checkout service (200ms), payment service (5000ms timeout), database query (4800ms).
  4. Clicks the database span. Sees the query. Sees it's a full table scan.
  5. Root cause identified in 2 minutes.

The trace doesn't just show you the error. It shows you the causal chain.

Structured events, not log lines

The biggest practical improvement you can make to your observability is switching from unstructured log lines to structured events.

An unstructured log line:

2024-03-15 14:32:07 ERROR PaymentService - Failed to process payment for order 89234: timeout after 5000ms

A structured event:

{
  "timestamp": "2024-03-15T14:32:07.234Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "order_id": "89234",
  "operation": "process_payment",
  "duration_ms": 5000,
  "error": "timeout",
  "payment_provider": "stripe",
  "retry_count": 2
}

The structured event carries context. You can query by order_id, filter by payment_provider, correlate by trace_id, and aggregate by operation. You can ask questions you didn't anticipate when you wrote the instrumentation. "What's the p99 latency for Stripe payments in the EU region?" is a query on structured events. It's impossible with grep.

The cardinality trap

Structured events sound great until you get the bill.

Every unique field value in your events is a cardinality dimension. status_code has maybe 20 unique values. Low cardinality. Cheap to index, great for dashboards. user_id has millions of unique values. High cardinality. Essential for debugging specific user issues. Extremely expensive to index.

request_id is unique for every single event. Infinite cardinality. Indexing it means indexing everything, which defeats the purpose of indexing.

This is where observability gets expensive. Not from volume (storing bytes is cheap) but from cardinality (indexing unique values is not). A team I worked with was spending $50,000/month on their observability platform, and 80% of that cost was from indexing three high-cardinality fields on every event.

The solution is tiered storage. Low-cardinality fields (service, status_code, region) go into your metrics system. They're aggregated and cheap. High-cardinality fields (user_id, order_id) go into your trace storage, where you only query them when investigating specific issues. The raw events go into cold storage (S3, GCS) where you can run batch queries if needed but don't pay for real-time indexing.

A practical instrumentation strategy

If you're starting from zero, or worse, starting from 50TB of unstructured logs, here's the order of operations.

Step 1: Traces. Instrument your services with OpenTelemetry. Start at the boundaries: HTTP handlers, database clients, message queue consumers. Every incoming request gets a trace_id. Every outgoing call creates a child span. This alone gives you the causal chain that logs never will.

OpenTelemetry is vendor-neutral. You can export to Jaeger, Tempo, Zipkin, or any commercial platform. Start with Grafana Tempo if you want free and open source. The instrumentation stays the same regardless of the backend.

Step 2: Metrics for SLOs. Define your service level objectives. "99.9% of checkout requests complete in under 2 seconds." Then instrument exactly what you need to measure that: latency histograms, error rates, and saturation (how full are your connection pools, queues, thread pools).

Use the RED method for request-driven services (Rate, Errors, Duration) and the USE method for resources (Utilization, Saturation, Errors). Resist the urge to measure everything. Measure what your SLOs require and what your on-call engineers need for triage.

Step 3: Logs for the unexpected. Once you have traces and metrics, logs become surgical. You don't log every request (traces cover that). You don't log aggregate stats (metrics cover that). You log the anomalies: retries, fallbacks, circuit breaker trips, unexpected input, business rule violations. These are the events that traces and metrics hint at but don't explain.

The tools, briefly

Instrumentation: OpenTelemetry. It's the standard. It supports every major language. Use it.

Trace storage: Grafana Tempo (column-oriented, cheap), Jaeger (battle-tested), or a commercial platform if you want managed operations.

Metrics: Prometheus for collection, Grafana for visualization. VictoriaMetrics if Prometheus's memory usage bothers you.

Log storage: ClickHouse for structured event queries at scale. Loki if you want Grafana-native and are okay with limited querying. Elasticsearch if you're already invested and have the ops team for it.

Dashboards and alerts: Grafana ties everything together. Traces, metrics, and logs in one UI with cross-linking via trace_id.

The cultural shift

The hardest part of observability isn't the tooling. It's changing how your team thinks about production systems.

Log-driven teams ask "what error message did the user see?" and search for it. Observable teams ask "what happened to request X?" and follow its trace through the system. The first approach works when you have one service. The second approach scales to fifty.

Invest in traces first, metrics second, logs third. Structure everything. Index selectively. And the next time someone says "just add a log line," ask them what question they're trying to answer. There's probably a better way to instrument it.