"The system is slow." Where do you look first? Logs? Metrics? Traces? The right answer depends on the question you're asking. The three pillars of observability are complementary, not competing. This article teaches you when to use each one.
Observability isn't about having data. It's about being able to answer any question about your system.
The Three Pillars
Overview
Metrics:
What: Aggregated numbers over time
Answers: "What is happening?"
Example: "p95 latency = 200ms"
Logs:
What: Discrete events with context
Answers: "What happened specifically?"
Example: "Request X failed with error Y at time Z"
Traces:
What: Journey of a request through the system
Answers: "Where did it go and how long did it take?"
Example: "Request went through A (50ms) → B (200ms) → C (30ms)"
Analogy
Metrics = Thermometer
"You have a fever of 102°F"
Know something is wrong, but not what
Logs = Medical history
"Patient reported sore throat 3 days ago"
Specific events that help diagnosis
Traces = Imaging scan
"Infection located in right tonsil"
Complete visualization of the problem
Metrics
Characteristics
Format: Numeric values with timestamp
Granularity: Aggregated (averages, percentiles, counts)
Volume: Low (compressed data points)
Retention: Long (months to years)
Cost: Low per stored data
Types of metrics
Counter:
- Always increments
- Ex: total requests, errors, bytes
Gauge:
- Can go up or down
- Ex: temperature, active connections, memory
Histogram:
- Distribution of values
- Ex: latency, payload size
Summary:
- Pre-calculated percentiles
- Ex: p50, p95, p99 of latency
Examples in Prometheus
# Counter: Requests per second rate
rate(http_requests_total[5m])
# Gauge: Active connections now
db_connections_active
# Histogram: 95th percentile latency
histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m]))
# Aggregation: Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
When to use metrics
✅ Use for:
- Alerts (threshold violated)
- Health dashboards
- Trends over time
- Period comparisons
- Capacity planning
❌ Don't use for:
- Investigating specific request
- Understanding error context
- Individual case debug
Logs
Characteristics
Format: Text or JSON with timestamp
Granularity: Individual event
Volume: High (each event is recorded)
Retention: Medium (days to weeks)
Cost: High per volume
Log levels
DEBUG:
- Details for development
- Never in production
INFO:
- Important normal events
- "User X logged in"
WARN:
- Unusual but not critical situations
- "Retry needed for service Y"
ERROR:
- Failures that need attention
- "Timeout connecting to database"
FATAL:
- System cannot continue
- "Invalid configuration, shutting down"
Structured logs
// ❌ Unstructured log
"2024-01-15 10:30:45 ERROR Failed to process order 12345 for user john@example.com"
// ✅ Structured log
{
"timestamp": "2024-01-15T10:30:45Z",
"level": "ERROR",
"message": "Failed to process order",
"order_id": "12345",
"user_email": "john@example.com",
"error_type": "PaymentDeclined",
"payment_provider": "stripe",
"trace_id": "abc123",
"duration_ms": 1523
}
When to use logs
✅ Use for:
- Investigating specific errors
- Audit and compliance
- Business flow debugging
- Failure context
❌ Don't use for:
- Aggregated metrics (use metrics)
- Visualizing request flow (use traces)
- Threshold alerts
Logging best practices
1. Always structured:
- JSON with standardized fields
- Facilitates queries and analysis
2. Include context:
- trace_id for correlation
- user_id for investigation
- request_id for tracking
3. Avoid unnecessary PII:
- Don't log passwords, tokens
- Mask sensitive data
4. Log at the right level:
- Production: INFO and above
- Debug only when necessary
Traces
Characteristics
Format: Spans connected by trace_id
Granularity: Per request, across services
Volume: Medium (sampled in high traffic)
Retention: Short (hours to days)
Cost: Medium to high
Anatomy of a trace
Trace ID: abc-123-xyz
├─ Span: API Gateway (0-50ms)
│ └─ Span: Auth Service (10-30ms)
│ └─ Span: Redis Cache (15-20ms)
│
├─ Span: Order Service (50-250ms)
│ ├─ Span: PostgreSQL Query (60-150ms)
│ └─ Span: Inventory Check (160-200ms)
│
└─ Span: Payment Service (250-400ms)
└─ Span: Stripe API (280-390ms)
Total: 400ms
Bottleneck identified: PostgreSQL Query (90ms)
Implementing tracing
// Example with OpenTelemetry
const span = tracer.startSpan('process_order', {
attributes: {
'order.id': orderId,
'customer.id': customerId,
},
});
try {
// Child span for DB operation
const dbSpan = tracer.startSpan('db.query', { parent: span });
const order = await db.getOrder(orderId);
dbSpan.end();
// Child span for payment
const paymentSpan = tracer.startSpan('payment.process', { parent: span });
await processPayment(order);
paymentSpan.end();
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
When to use traces
✅ Use for:
- Identifying bottlenecks in slow requests
- Understanding flow between microservices
- Distributed latency debugging
- Visualizing dependencies
❌ Don't use for:
- Alerts (too granular)
- Long-term trends (use metrics)
- Detailed auditing (use logs)
Integrating the Three Pillars
Investigation flow
1. ALERT (metric)
"Error rate rose to 5%"
2. CONTEXT (metric)
"Increase correlated with DB latency"
3. INVESTIGATION (trace)
"Slow requests go through query X"
4. DETAIL (log)
"Query X failing with timeout due to lock"
5. ROOT CAUSE
"Deploy Y introduced lock contention"
Correlation by trace_id
The trace_id connects everything:
Metric:
http_request_duration{trace_id="abc123"} = 2.5s
Trace:
trace_id: abc123
spans: [gateway, auth, order, payment]
duration: 2500ms
bottleneck: payment (2000ms)
Log:
{
"trace_id": "abc123",
"service": "payment",
"message": "Stripe timeout after 2000ms",
"error_code": "TIMEOUT"
}
Practical investigation example
## Scenario: "System slow at 10am"
### 1. Metrics (Grafana)
- p95 latency: 2s (normal: 200ms)
- Throughput: normal
- Error rate: 3% (normal: 0.1%)
→ Problem confirmed, not perception
### 2. Drill-down in metrics
- Latency by endpoint: /api/checkout 10x slower
- Latency by service: Payment service degraded
→ Problem located in payment
### 3. Traces (Jaeger)
- Trace of slow request
- 90% of time in span "stripe_api_call"
→ Bottleneck is external call to Stripe
### 4. Logs (Elasticsearch)
```json
{
"timestamp": "2024-01-15T10:05:00Z",
"service": "payment",
"message": "Stripe API retry attempt 3",
"trace_id": "xyz789",
"response_time_ms": 1800,
"stripe_error": "rate_limited"
}
→ Root cause: Stripe rate limiting
5. Resolution
- Implement circuit breaker
- Add validation cache
- Negotiate rate limit with Stripe
## Tools by Pillar
### Open source stack
```yaml
Metrics:
Collection: Prometheus, Victoria Metrics
Visualization: Grafana
Alerts: Alertmanager
Logs:
Collection: Fluentd, Fluent Bit, Vector
Storage: Elasticsearch, Loki
Visualization: Kibana, Grafana
Traces:
Collection: OpenTelemetry, Jaeger Agent
Storage: Jaeger, Tempo, Zipkin
Visualization: Jaeger UI, Grafana
Managed stack
All-in-one:
- Datadog
- New Relic
- Dynatrace
- Splunk
By pillar:
Metrics: CloudWatch, Datadog
Logs: CloudWatch Logs, Papertrail
Traces: X-Ray, Honeycomb
Conclusion
The three pillars are complementary:
- Metrics: Detect that something is wrong (alerts)
- Traces: Show where the problem is (location)
- Logs: Explain why it happened (context)
Use all three together:
- Correlate by trace_id
- Start with metrics for overview
- Use traces to locate
- Use logs for detail
A truly observable system allows you to answer questions you didn't think to ask before having the problem.
This article is part of the series on the OCTOPUS Performance Engineering methodology.