Observability: seeing system behavior

Performance isn't about intuition — it's about data. Without clear visibility into system behavior, optimization becomes guessing. Observability is the ability to understand the internal state of a system through its external outputs.

This article explores the pillars of observability and how to build a solid foundation for performance work.

If you can't see it, you can't measure it. If you can't measure it, you can't improve it.

The Three Pillars

1. Metrics

Numerical values aggregated over time.

cpu_usage: 75%
request_latency_p95: 120ms
error_rate: 0.5%
active_connections: 234

Characteristics:

Compact and efficient
Ideal for dashboards and alerts
Show trends
Lose individual details

When to use:

Continuous monitoring
Threshold-based alerts
Capacity planning
SLO tracking

2. Logs

Discrete records of events.

2024-01-15 10:23:45 INFO  [req-123] User 456 logged in
2024-01-15 10:23:46 ERROR [req-124] Database timeout after 5000ms
2024-01-15 10:23:47 WARN  [req-125] Retry attempt 2 for payment service

Characteristics:

Rich in context
Flexible (free text or structured)
Volume can be very high
Hard to aggregate

When to use:

Debugging specific problems
Auditing
Post-incident forensic analysis
Understanding event sequences

3. Traces

Request tracking through the system.

Request abc-123 (total: 250ms)
├── API Gateway (5ms)
├── Auth Service (20ms)
├── Product Service (180ms)
│   ├── Cache lookup (2ms)
│   ├── Database query (150ms)  ← Bottleneck!
│   └── Response serialization (28ms)
└── Response sent

Characteristics:

Show end-to-end flow
Identify bottlenecks
Connect distributed services
Instrumentation overhead

When to use:

Latency debugging
Understanding dependencies
Identifying problematic services
Distributed performance analysis

Essential Performance Metrics

RED Method (for services)

Rate: requests per second
Errors: error rate
Duration: request latency

USE Method (for resources)

Utilization: percentage of time in use
Saturation: queued work
Errors: resource errors

The Four Golden Signals (Google SRE)

Latency: time to serve requests
Traffic: demand on the system
Errors: rate of failing requests
Saturation: how "full" the system is

Implementing Observability

Common stack

Application
    ↓ (metrics)
Prometheus / Datadog / New Relic
    ↓
Grafana (visualization)

    ↓ (logs)
Elasticsearch / Loki / Splunk
    ↓
Kibana / Grafana

    ↓ (traces)
Jaeger / Zipkin / Datadog APM

Instrumentation

Metrics:

// Counter
requestsTotal.inc({ endpoint: '/api/users', status: 200 });

// Histogram for latency
const timer = requestDuration.startTimer();
// ... process ...
timer({ endpoint: '/api/users' });

Structured logs:

logger.info({
    event: 'request_completed',
    requestId: 'abc-123',
    userId: 456,
    endpoint: '/api/users',
    duration: 120,
    status: 200
});

Traces:

const span = tracer.startSpan('database_query');
span.setTag('query', 'SELECT * FROM users');
// ... execute query ...
span.finish();

Effective Dashboards

Principles

Hierarchy: overview → detail
Context: show what's normal
Action: each chart should inform a decision
Simplicity: less is more

Typical performance dashboard

Level 1 - Overview:

Request rate
Latency p50, p95, p99
Error rate
Throughput

Level 2 - Per service:

RED metrics per endpoint
Dependencies and their latencies
Resources (CPU, memory)

Level 3 - Detail:

Slowest queries
Traces of specific requests
Filtered logs

Performance Alerts

What to alert on

Metric	Alert
Latency p99	> 2x normal for 5 min
Error rate	> 1% for 2 min
CPU saturation	> 80% for 10 min
Availability	< 99.9% in 1h window

Best practices

Alert on symptoms, not causes
- Good: "Checkout latency > 500ms"
- Bad: "High CPU on server X"
Avoid alert fatigue
- Every alert should be actionable
- If you regularly ignore it, remove it

Context in alert

ALERT: Latency p99 at 850ms (normal: 200ms)
Dashboard: link
Runbook: link
Recent deploys: link

Data Correlation

Why correlate

Deploy at 14:00
    ↓
Latency rises at 14:05
    ↓
Logs show connection errors
    ↓
Trace reveals new code calling DB without index

Without correlation, each tool shows part of the story.

How to correlate

Request ID in all logs and traces
Synchronized timestamps (NTP)
Consistent tags (environment, service, version)
Integrated tools or exporting to same destination

Observability Cost

Trade-offs

More data	Cost
More metrics	More storage, more cardinality
More logs	More storage, more processing
More traces	Instrumentation overhead

Optimization strategies

Trace sampling (don't need 100%)
Metric aggregation (don't need 1s granularity)
Differentiated retention (recent data detailed, old data aggregated)
Appropriate log levels (DEBUG only when necessary)

Observability for Performance

Questions you should be able to answer

What is the current system latency?
What was the latency 1 week ago?
Which endpoint is slowest?
Where does a request spend its time?
What changed when latency increased?
Are we close to saturation?

If you can't answer...

...you don't have enough observability to work on performance seriously.

Conclusion

Observability isn't optional for performance — it's a prerequisite.

Invest in:

Metrics for trends and alerts
Logs for debugging
Traces for understanding flows
Correlation to connect the dots
Dashboards that tell a story
Alerts that are actionable

Modern systems are black boxes without observability. Open the box before trying to optimize it.

The Three Pillars

1. Metrics

2. Logs

3. Traces

Essential Performance Metrics

RED Method (for services)

USE Method (for resources)

The Four Golden Signals (Google SRE)

Implementing Observability

Common stack

Instrumentation

Effective Dashboards

Principles

Typical performance dashboard

Performance Alerts

What to alert on

Best practices

Data Correlation

Why correlate

How to correlate

Observability Cost

Trade-offs

Optimization strategies

Observability for Performance

Questions you should be able to answer

If you can't answer...

Conclusion

Want to understand your platform's limits?