"The system is slow, it must be the database." How many times have you heard this? And how many times was it wrong? Finding the real bottleneck is one of the most valuable skills in performance engineering. This article teaches you how to do it systematically.
A real bottleneck is the point that limits the throughput of the entire system. Everything else is noise.
What Is a Bottleneck
Definition
Bottleneck = The component that, if improved,
would increase the total system throughput
What is NOT a bottleneck
❌ Code that seems slow (but isn't in the critical path)
❌ Function that consumes CPU (but runs rarely)
❌ Query that takes 500ms (but runs once a day)
❌ Service with high latency (but not called by users)
Characteristics of a real bottleneck
✅ Is in the request's critical path
✅ Is accessed with significant frequency
✅ Limits the system's maximum throughput
✅ Improving it improves user experience
Amdahl's Law
The concept
Maximum speedup = 1 / ((1 - P) + P/S)
Where:
P = Fraction of time spent in the component
S = Improvement factor applied
Example:
If component A represents 10% of total time,
even improving A by 10x, maximum gain is ~11%
Practical implication
Scenario: 1000ms request
Components:
- API Gateway: 50ms (5%)
- Auth: 100ms (10%)
- Business Logic: 150ms (15%)
- Database: 600ms (60%)
- Response: 100ms (10%)
If optimize Auth by 50%:
Savings: 50ms
New total: 950ms
Improvement: 5%
If optimize Database by 50%:
Savings: 300ms
New total: 700ms
Improvement: 30%
→ Always attack the largest contributor first
Methodology for Identifying Bottlenecks
Step 1: Map the complete flow
User request
↓
┌─────────────────┐
│ Load Balancer │ → Metrics: latency, connections
└────────┬────────┘
↓
┌─────────────────┐
│ API Gateway │ → Metrics: rate, errors, duration
└────────┬────────┘
↓
┌─────────────────┐
│ Auth Service │ → Metrics: cache hit, token time
└────────┬────────┘
↓
┌─────────────────┐
│ Order Service │ → Metrics: processing time
├────────┬────────┤
│ ↓ │ ↓ │
│ DB │ Cache │ → Metrics: query time, hit rate
└────────┴────────┘
↓
Response
Step 2: Measure each component
# Latency by component
histogram_quantile(0.95,
sum by(component) (rate(component_duration_seconds_bucket[5m]))
)
# Relative time by component
sum by(component) (rate(component_duration_seconds_sum[5m]))
/ sum(rate(request_duration_seconds_sum[5m]))
# Calls by component
sum by(component) (rate(component_calls_total[5m]))
Step 3: Identify the largest contributor
Trace analysis (example):
Total request: 450ms
Breakdown:
├─ Gateway: 15ms (3%)
├─ Auth: 25ms (6%)
├─ Validation: 10ms (2%)
├─ DB Query 1: 180ms (40%) ← BOTTLENECK
├─ DB Query 2: 120ms (27%) ← SECOND LARGEST
├─ External API: 80ms (18%)
└─ Serialization: 20ms (4%)
Focus: DB Query 1 (40% of time)
Step 4: Validate hypothesis
Before optimizing, confirm:
1. Is it consistent?
→ Does bottleneck appear in multiple traces?
2. Is it frequent?
→ How many requests go through this path?
3. Does it impact the user?
→ Is it in the critical path of the journey?
4. Is it optimizable?
→ Is there realistic improvement potential?
Types of Bottlenecks
1. CPU bottleneck
Symptoms:
- CPU at 100%
- Latency increases with load
- Throughput plateaus at fixed point
Common causes:
- Inefficient algorithms (O(n²))
- Excessive serialization/deserialization
- CPU-bound cryptography
- Complex regex
Diagnosis:
- CPU profiler (async-profiler, py-spy)
- top/htop to identify process
- perf for flame graphs
2. I/O bottleneck
Symptoms:
- Low CPU, high latency
- Many connections in WAIT
- High iowait
Common causes:
- Queries without index
- Slow disk
- Network latency
- Connection starvation
Diagnosis:
- iostat for disk
- netstat for connections
- Database slow query log
3. Memory bottleneck
Symptoms:
- OOM events
- Frequent GC
- High swap usage
Common causes:
- Memory leaks
- Unbounded caches
- Large objects in memory
Diagnosis:
- Heap dumps
- GC logs
- Memory profilers
4. Concurrency bottleneck
Symptoms:
- Low CPU utilization
- High latency under load
- Threads in WAIT
Common causes:
- Contentious locks
- Small connection pool
- Exhausted thread pool
Diagnosis:
- Thread dumps
- Lock profiler
- Pool metrics
Diagnostic Tools
For code
Java:
- async-profiler (CPU, allocation)
- JFR (Java Flight Recorder)
- VisualVM
Python:
- py-spy (CPU profiler)
- memory_profiler
- cProfile
Node.js:
- clinic.js
- 0x (flame graphs)
- v8-profiler
Go:
- pprof
- trace
- bench
For system
Linux:
- perf (CPU profiling)
- strace (system calls)
- eBPF/bcc tools
- sar (historical)
Containers:
- cAdvisor
- Prometheus node_exporter
- kubectl top
For database
PostgreSQL:
- pg_stat_statements
- EXPLAIN ANALYZE
- pg_stat_user_tables
MySQL:
- slow query log
- EXPLAIN
- performance_schema
Redis:
- SLOWLOG
- INFO stats
- MEMORY DOCTOR
Practical Example: Complete Investigation
Scenario
Problem: Slow checkout (p95 = 3s, SLO = 1s)
Investigation
## Step 1: General metrics
- p95 checkout: 3.2s
- Throughput: 50 req/s
- Error rate: 0.5%
- CPU: 35%, Memory: 60%
→ Not an infrastructure resource problem
## Step 2: Breakdown by service
- API Gateway: 100ms (3%)
- Cart Service: 200ms (6%)
- Inventory: 300ms (10%)
- Payment: 2400ms (75%) ← SUSPECT
- Notification: 200ms (6%)
→ Payment service dominates time
## Step 3: Drill-down into Payment
- Auth: 50ms
- Validation: 100ms
- Stripe API: 2200ms ← BOTTLENECK
- Logging: 50ms
→ External call to Stripe is the bottleneck
## Step 4: Stripe call analysis
- Configured timeout: 30s (too high)
- Retries: 3 (total wait can be 90s)
- p50: 400ms, p99: 8s
- Very high variance
→ Stripe has variable latency, no circuit breaker
## Step 5: Validation
- 95% of checkouts go through Stripe
- It's the critical path (no payment, no sale)
- Directly impacts conversion
## Root cause:
Variable latency from Stripe API + inadequate
timeout/retry configuration
## Solution:
1. Timeout: 30s → 5s
2. Retry: exponential backoff
3. Circuit breaker: fail fast if Stripe unstable
4. Cache: card validations when possible
Common Pitfalls
1. Optimizing what's easy, not what matters
❌ "I'll cache this endpoint because I know how"
→ Endpoint represents 0.1% of traffic
✅ "I'll investigate what really matters"
→ Identify the 20% causing 80% of the problem
2. Confusing latency with bottleneck
❌ "This function takes 500ms, it's the bottleneck"
→ But it's called 1x per hour
✅ "This function takes 5ms but is called 10K/s"
→ Represents 50s of CPU per second
3. Ignoring cascade effects
❌ "Service A is slow"
→ But A depends on B which depends on C
✅ "A is slow because C is saturated"
→ Fixing C fixes A
Conclusion
Identifying real bottlenecks requires method:
- Map the complete flow of the request
- Measure each component in the critical path
- Identify the largest contributor (Amdahl's Law)
- Validate the hypothesis before optimizing
- Attack the real bottleneck, not the perceived one
The worst waste in performance is optimizing something that isn't a bottleneck.
This article is part of the series on the OCTOPUS Performance Engineering methodology.