"Let's increase server memory." How much? "Double it." Why? "It seems like it will help." This isn't tuning — it's guessing. Data-driven tuning means using real metrics to identify what to adjust, how much to adjust, and validate if it worked.
Every adjustment must have a hypothesis, a metric, and an expected result.
The Scientific Tuning Process
The method
1. Observe: Collect metrics of current state
2. Hypothesize: Formulate what to adjust and why
3. Predict: Estimate expected impact
4. Test: Apply change in controlled environment
5. Measure: Collect post-change metrics
6. Validate: Compare prediction with result
7. Document: Record for future reference
Why it works
Benefits:
- Avoids ineffective changes
- Prioritizes by real impact
- Creates documented knowledge
- Enables data-based rollback
- Communicates value to stakeholders
Collecting Data Before Tuning
Essential metrics
Performance:
- Latency (p50, p95, p99)
- Throughput (req/s, tx/s)
- Error rate
Resources:
- CPU utilization
- Memory usage
- Disk I/O
- Network bandwidth
Application:
- Connection pool usage
- Thread pool usage
- Cache hit rate
- GC time/frequency
Documented baseline
## Baseline - System XYZ
Date: 2024-01-15
Load: 500 req/s (steady state)
### Latency
| Endpoint | p50 | p95 | p99 |
|----------|-----|-----|-----|
| /api/orders | 45ms | 120ms | 350ms |
| /api/products | 30ms | 80ms | 200ms |
### Resources
| Component | Usage | Limit |
|-----------|-------|-------|
| App CPU | 65% | 100% |
| App Memory | 4.2GB | 8GB |
| DB Connections | 45 | 100 |
| Redis Memory | 2.1GB | 4GB |
### Identified Bottlenecks
1. /api/orders p99 high (350ms)
2. DB connections at 45% under normal load
Formulating Hypotheses
Hypothesis template
## Hypothesis #1
**Observation**:
p99 of /api/orders is 350ms, while p50 is 45ms
**Analysis**:
- Trace shows variance in history query
- Query full scan when history > 1000 items
**Hypothesis**:
Adding index on orders(user_id, created_at) will reduce p99
**Prediction**:
- Current p99: 350ms
- Expected p99: < 100ms
- Reduction: > 70%
**Risk**:
- Index adds ~5% overhead on writes
- Additional space: ~500MB
**Decision**: Proceed (write overhead acceptable)
Prioritizing hypotheses
Prioritization criteria:
Impact:
- How many users affected?
- How much time saved?
- What business value?
Effort:
- How long to implement?
- What complexity?
- What risk?
Confidence:
- How certain is the hypothesis?
- Do we have enough data?
Decision matrix:
High impact + Low effort + High confidence → DO FIRST
High impact + High effort + High confidence → PLAN
Low impact + Any effort → IGNORE
Any + Low confidence → INVESTIGATE MORE
Executing Controlled Tests
Test environment
Requirements:
- Similar to production (data, load)
- Isolated (no interference)
- Monitored (all metrics)
- Reproducible (same conditions)
Process:
1. Capture baseline in test environment
2. Apply single change
3. Execute same load
4. Collect metrics
5. Compare with baseline
One change at a time
❌ Wrong:
"I'll increase memory, threads and timeout at once"
→ Don't know which change caused the effect
✅ Correct:
Test 1: Increase memory → Measure
Test 2: Increase threads → Measure
Test 3: Increase timeout → Measure
→ Know the impact of each change
Adequate duration
Minimum time:
- Smoke test: 5 minutes (validate it works)
- Baseline: 30 minutes (stabilize)
- Change test: 30 minutes (same time)
Why:
- JIT needs to warm up
- Caches need to populate
- Metrics need to stabilize
- Variance needs to be captured
Analyzing Results
Structured comparison
## Result - Hypothesis #1 (Index on orders)
### Before/After Metrics
| Metric | Before | After | Δ |
|--------|--------|-------|---|
| p50 | 45ms | 42ms | -7% |
| p95 | 120ms | 65ms | -46% |
| p99 | 350ms | 85ms | -76% ✓ |
| Write latency | 5ms | 5.2ms | +4% |
| DB CPU | 35% | 32% | -9% |
### Hypothesis Validation
- Prediction: p99 < 100ms
- Result: p99 = 85ms
- Status: ✅ Confirmed
### Side Effects
- Write overhead: +4% (acceptable)
- Disk space: +450MB (acceptable)
### Decision
Apply to production ✓
Statistical significance
Cautions:
- Natural variance exists
- Small sample can deceive
- Compare distributions, not just averages
Techniques:
- T-test for comparing means
- Mann-Whitney for distributions
- Bootstrap for confidence intervals
Rule of thumb:
- Difference > 10%: probably significant
- Difference < 5%: may be noise
- Between 5-10%: needs more data
Common Types of Tuning
JVM Tuning
Common parameters:
Heap size:
-Xms, -Xmx
Hypothesis: "Increasing heap reduces GC"
Metric: GC time, GC frequency
GC algorithm:
-XX:+UseG1GC, -XX:+UseZGC
Hypothesis: "G1 better for latency"
Metric: p99, GC pause time
Thread pools:
-XX:ParallelGCThreads
Hypothesis: "More GC threads reduces pause"
Metric: GC pause duration
Test example:
Baseline: -Xmx4g -XX:+UseG1GC
Test 1: -Xmx8g -XX:+UseG1GC
Test 2: -Xmx4g -XX:+UseZGC
→ Compare p99 and throughput
Database Tuning
PostgreSQL:
shared_buffers:
Hypothesis: "More buffer = less disk I/O"
Metric: buffer hit ratio, disk reads
work_mem:
Hypothesis: "More work_mem = in-memory sorts"
Metric: temp file usage, query time
max_connections:
Hypothesis: "More connections = more concurrency"
Caution: Can have inverse effect!
Test example:
Baseline: shared_buffers = 1GB
Test 1: shared_buffers = 2GB
Test 2: shared_buffers = 4GB
→ Measure buffer hit ratio and latency
Connection Pool Tuning
Parameters:
Pool size:
Formula: connections = (cores * 2) + disk_spindles
Hypothesis: "Pool too large causes contention"
Timeout:
Hypothesis: "Short timeout fails fast"
Idle timeout:
Hypothesis: "Keeping connections avoids overhead"
Metrics:
- Connection wait time
- Pool utilization
- Timeout errors
- DB connection count
Documenting Tuning
Documentation template
# Tuning Log - System XYZ
## Entry #1 - 2024-01-15
### Change
Parameter: PostgreSQL shared_buffers
Previous value: 1GB
New value: 4GB
### Motivation
Buffer hit ratio at 85%, target is > 95%
### Test
- Environment: Staging
- Load: 500 req/s for 1 hour
- Baseline captured: yes
### Results
| Metric | Before | After |
|--------|--------|-------|
| Buffer hit ratio | 85% | 96% |
| Avg query time | 15ms | 8ms |
| p99 query time | 150ms | 45ms |
### Decision
Applied to production on 2024-01-16
### Follow-up (1 week later)
Results maintained in production ✓
Tuning library
Keep record of:
- Changes that worked
- Changes that didn't work
- Context (version, load, hardware)
- Observed side effects
Benefits:
- Avoids repeating mistakes
- Accelerates troubleshooting
- Onboarding new members
- Base for automation
Tuning Anti-Patterns
1. Cargo cult tuning
❌ "I read that 4GB heap is good"
→ Applies without measuring
✅ "I'll test 2GB, 4GB and 8GB"
→ Measures each configuration in real context
2. Tuning without baseline
❌ "I increased threads and it seems faster"
→ No data to compare
✅ "Baseline: 100ms p95. After change: 60ms p95"
→ Quantified improvement
3. Multiple simultaneous changes
❌ "I changed heap, threads and timeout. It got better!"
→ Which change helped? Which might have hurt?
✅ One change at a time
→ Understands the impact of each adjustment
4. Ignoring side effects
❌ "p99 improved 50%!"
→ But throughput dropped 30%
✅ Evaluate all relevant metrics
→ Net positive improvement
Conclusion
Data-driven tuning means:
- Measure before - documented baseline
- Formulate hypothesis - quantified prediction
- Test isolated - one change at a time
- Validate result - compare with prediction
- Document - create reusable knowledge
The result: changes that demonstrably improve, not guesses that might make things worse.
Tuning without data is gambling. Tuning with data is engineering.
This article is part of the series on the OCTOPUS Performance Engineering methodology.