Performance isn't about intuition — it's about data. Without clear visibility into system behavior, optimization becomes guessing. Observability is the ability to understand the internal state of a system through its external outputs.
This article explores the pillars of observability and how to build a solid foundation for performance work.
If you can't see it, you can't measure it. If you can't measure it, you can't improve it.
The Three Pillars
1. Metrics
Numerical values aggregated over time.
cpu_usage: 75%
request_latency_p95: 120ms
error_rate: 0.5%
active_connections: 234
Characteristics:
- Compact and efficient
- Ideal for dashboards and alerts
- Show trends
- Lose individual details
When to use:
- Continuous monitoring
- Threshold-based alerts
- Capacity planning
- SLO tracking
2. Logs
Discrete records of events.
2024-01-15 10:23:45 INFO [req-123] User 456 logged in
2024-01-15 10:23:46 ERROR [req-124] Database timeout after 5000ms
2024-01-15 10:23:47 WARN [req-125] Retry attempt 2 for payment service
Characteristics:
- Rich in context
- Flexible (free text or structured)
- Volume can be very high
- Hard to aggregate
When to use:
- Debugging specific problems
- Auditing
- Post-incident forensic analysis
- Understanding event sequences
3. Traces
Request tracking through the system.
Request abc-123 (total: 250ms)
├── API Gateway (5ms)
├── Auth Service (20ms)
├── Product Service (180ms)
│ ├── Cache lookup (2ms)
│ ├── Database query (150ms) ← Bottleneck!
│ └── Response serialization (28ms)
└── Response sent
Characteristics:
- Show end-to-end flow
- Identify bottlenecks
- Connect distributed services
- Instrumentation overhead
When to use:
- Latency debugging
- Understanding dependencies
- Identifying problematic services
- Distributed performance analysis
Essential Performance Metrics
RED Method (for services)
- Rate: requests per second
- Errors: error rate
- Duration: request latency
USE Method (for resources)
- Utilization: percentage of time in use
- Saturation: queued work
- Errors: resource errors
The Four Golden Signals (Google SRE)
- Latency: time to serve requests
- Traffic: demand on the system
- Errors: rate of failing requests
- Saturation: how "full" the system is
Implementing Observability
Common stack
Application
↓ (metrics)
Prometheus / Datadog / New Relic
↓
Grafana (visualization)
↓ (logs)
Elasticsearch / Loki / Splunk
↓
Kibana / Grafana
↓ (traces)
Jaeger / Zipkin / Datadog APM
Instrumentation
Metrics:
// Counter
requestsTotal.inc({ endpoint: '/api/users', status: 200 });
// Histogram for latency
const timer = requestDuration.startTimer();
// ... process ...
timer({ endpoint: '/api/users' });
Structured logs:
logger.info({
event: 'request_completed',
requestId: 'abc-123',
userId: 456,
endpoint: '/api/users',
duration: 120,
status: 200
});
Traces:
const span = tracer.startSpan('database_query');
span.setTag('query', 'SELECT * FROM users');
// ... execute query ...
span.finish();
Effective Dashboards
Principles
- Hierarchy: overview → detail
- Context: show what's normal
- Action: each chart should inform a decision
- Simplicity: less is more
Typical performance dashboard
Level 1 - Overview:
- Request rate
- Latency p50, p95, p99
- Error rate
- Throughput
Level 2 - Per service:
- RED metrics per endpoint
- Dependencies and their latencies
- Resources (CPU, memory)
Level 3 - Detail:
- Slowest queries
- Traces of specific requests
- Filtered logs
Performance Alerts
What to alert on
| Metric | Alert |
|---|---|
| Latency p99 | > 2x normal for 5 min |
| Error rate | > 1% for 2 min |
| CPU saturation | > 80% for 10 min |
| Availability | < 99.9% in 1h window |
Best practices
Alert on symptoms, not causes
- Good: "Checkout latency > 500ms"
- Bad: "High CPU on server X"
Avoid alert fatigue
- Every alert should be actionable
- If you regularly ignore it, remove it
Context in alert
ALERT: Latency p99 at 850ms (normal: 200ms) Dashboard: link Runbook: link Recent deploys: link
Data Correlation
Why correlate
Deploy at 14:00
↓
Latency rises at 14:05
↓
Logs show connection errors
↓
Trace reveals new code calling DB without index
Without correlation, each tool shows part of the story.
How to correlate
- Request ID in all logs and traces
- Synchronized timestamps (NTP)
- Consistent tags (environment, service, version)
- Integrated tools or exporting to same destination
Observability Cost
Trade-offs
| More data | Cost |
|---|---|
| More metrics | More storage, more cardinality |
| More logs | More storage, more processing |
| More traces | Instrumentation overhead |
Optimization strategies
- Trace sampling (don't need 100%)
- Metric aggregation (don't need 1s granularity)
- Differentiated retention (recent data detailed, old data aggregated)
- Appropriate log levels (DEBUG only when necessary)
Observability for Performance
Questions you should be able to answer
- What is the current system latency?
- What was the latency 1 week ago?
- Which endpoint is slowest?
- Where does a request spend its time?
- What changed when latency increased?
- Are we close to saturation?
If you can't answer...
...you don't have enough observability to work on performance seriously.
Conclusion
Observability isn't optional for performance — it's a prerequisite.
Invest in:
- Metrics for trends and alerts
- Logs for debugging
- Traces for understanding flows
- Correlation to connect the dots
- Dashboards that tell a story
- Alerts that are actionable
Modern systems are black boxes without observability. Open the box before trying to optimize it.