Methodology9 min

Observability: seeing system behavior

You can't improve what you can't see. Observability is the foundation of any serious performance work.

Performance isn't about intuition — it's about data. Without clear visibility into system behavior, optimization becomes guessing. Observability is the ability to understand the internal state of a system through its external outputs.

This article explores the pillars of observability and how to build a solid foundation for performance work.

If you can't see it, you can't measure it. If you can't measure it, you can't improve it.

The Three Pillars

1. Metrics

Numerical values aggregated over time.

cpu_usage: 75%
request_latency_p95: 120ms
error_rate: 0.5%
active_connections: 234

Characteristics:

  • Compact and efficient
  • Ideal for dashboards and alerts
  • Show trends
  • Lose individual details

When to use:

  • Continuous monitoring
  • Threshold-based alerts
  • Capacity planning
  • SLO tracking

2. Logs

Discrete records of events.

2024-01-15 10:23:45 INFO  [req-123] User 456 logged in
2024-01-15 10:23:46 ERROR [req-124] Database timeout after 5000ms
2024-01-15 10:23:47 WARN  [req-125] Retry attempt 2 for payment service

Characteristics:

  • Rich in context
  • Flexible (free text or structured)
  • Volume can be very high
  • Hard to aggregate

When to use:

  • Debugging specific problems
  • Auditing
  • Post-incident forensic analysis
  • Understanding event sequences

3. Traces

Request tracking through the system.

Request abc-123 (total: 250ms)
├── API Gateway (5ms)
├── Auth Service (20ms)
├── Product Service (180ms)
│   ├── Cache lookup (2ms)
│   ├── Database query (150ms)  ← Bottleneck!
│   └── Response serialization (28ms)
└── Response sent

Characteristics:

  • Show end-to-end flow
  • Identify bottlenecks
  • Connect distributed services
  • Instrumentation overhead

When to use:

  • Latency debugging
  • Understanding dependencies
  • Identifying problematic services
  • Distributed performance analysis

Essential Performance Metrics

RED Method (for services)

  • Rate: requests per second
  • Errors: error rate
  • Duration: request latency

USE Method (for resources)

  • Utilization: percentage of time in use
  • Saturation: queued work
  • Errors: resource errors

The Four Golden Signals (Google SRE)

  1. Latency: time to serve requests
  2. Traffic: demand on the system
  3. Errors: rate of failing requests
  4. Saturation: how "full" the system is

Implementing Observability

Common stack

Application
    ↓ (metrics)
Prometheus / Datadog / New Relic
    ↓
Grafana (visualization)

    ↓ (logs)
Elasticsearch / Loki / Splunk
    ↓
Kibana / Grafana

    ↓ (traces)
Jaeger / Zipkin / Datadog APM

Instrumentation

Metrics:

// Counter
requestsTotal.inc({ endpoint: '/api/users', status: 200 });

// Histogram for latency
const timer = requestDuration.startTimer();
// ... process ...
timer({ endpoint: '/api/users' });

Structured logs:

logger.info({
    event: 'request_completed',
    requestId: 'abc-123',
    userId: 456,
    endpoint: '/api/users',
    duration: 120,
    status: 200
});

Traces:

const span = tracer.startSpan('database_query');
span.setTag('query', 'SELECT * FROM users');
// ... execute query ...
span.finish();

Effective Dashboards

Principles

  1. Hierarchy: overview → detail
  2. Context: show what's normal
  3. Action: each chart should inform a decision
  4. Simplicity: less is more

Typical performance dashboard

Level 1 - Overview:

  • Request rate
  • Latency p50, p95, p99
  • Error rate
  • Throughput

Level 2 - Per service:

  • RED metrics per endpoint
  • Dependencies and their latencies
  • Resources (CPU, memory)

Level 3 - Detail:

  • Slowest queries
  • Traces of specific requests
  • Filtered logs

Performance Alerts

What to alert on

Metric Alert
Latency p99 > 2x normal for 5 min
Error rate > 1% for 2 min
CPU saturation > 80% for 10 min
Availability < 99.9% in 1h window

Best practices

  1. Alert on symptoms, not causes

    • Good: "Checkout latency > 500ms"
    • Bad: "High CPU on server X"
  2. Avoid alert fatigue

    • Every alert should be actionable
    • If you regularly ignore it, remove it
  3. Context in alert

    ALERT: Latency p99 at 850ms (normal: 200ms)
    Dashboard: link
    Runbook: link
    Recent deploys: link
    

Data Correlation

Why correlate

Deploy at 14:00
    ↓
Latency rises at 14:05
    ↓
Logs show connection errors
    ↓
Trace reveals new code calling DB without index

Without correlation, each tool shows part of the story.

How to correlate

  1. Request ID in all logs and traces
  2. Synchronized timestamps (NTP)
  3. Consistent tags (environment, service, version)
  4. Integrated tools or exporting to same destination

Observability Cost

Trade-offs

More data Cost
More metrics More storage, more cardinality
More logs More storage, more processing
More traces Instrumentation overhead

Optimization strategies

  1. Trace sampling (don't need 100%)
  2. Metric aggregation (don't need 1s granularity)
  3. Differentiated retention (recent data detailed, old data aggregated)
  4. Appropriate log levels (DEBUG only when necessary)

Observability for Performance

Questions you should be able to answer

  1. What is the current system latency?
  2. What was the latency 1 week ago?
  3. Which endpoint is slowest?
  4. Where does a request spend its time?
  5. What changed when latency increased?
  6. Are we close to saturation?

If you can't answer...

...you don't have enough observability to work on performance seriously.

Conclusion

Observability isn't optional for performance — it's a prerequisite.

Invest in:

  1. Metrics for trends and alerts
  2. Logs for debugging
  3. Traces for understanding flows
  4. Correlation to connect the dots
  5. Dashboards that tell a story
  6. Alerts that are actionable

Modern systems are black boxes without observability. Open the box before trying to optimize it.

observabilitymonitoringmetricslogstraces

Want to understand your platform's limits?

Contact us for a performance assessment.

Contact Us