Data-Driven Tuning: systematic evidence-based optimization

"Let's increase server memory." How much? "Double it." Why? "It seems like it will help." This isn't tuning — it's guessing. Data-driven tuning means using real metrics to identify what to adjust, how much to adjust, and validate if it worked.

Every adjustment must have a hypothesis, a metric, and an expected result.

The Scientific Tuning Process

The method

1. Observe: Collect metrics of current state
2. Hypothesize: Formulate what to adjust and why
3. Predict: Estimate expected impact
4. Test: Apply change in controlled environment
5. Measure: Collect post-change metrics
6. Validate: Compare prediction with result
7. Document: Record for future reference

Why it works

Benefits:
  - Avoids ineffective changes
  - Prioritizes by real impact
  - Creates documented knowledge
  - Enables data-based rollback
  - Communicates value to stakeholders

Collecting Data Before Tuning

Essential metrics

Performance:
  - Latency (p50, p95, p99)
  - Throughput (req/s, tx/s)
  - Error rate

Resources:
  - CPU utilization
  - Memory usage
  - Disk I/O
  - Network bandwidth

Application:
  - Connection pool usage
  - Thread pool usage
  - Cache hit rate
  - GC time/frequency

Documented baseline

## Baseline - System XYZ
Date: 2024-01-15
Load: 500 req/s (steady state)

### Latency
| Endpoint | p50 | p95 | p99 |
|----------|-----|-----|-----|
| /api/orders | 45ms | 120ms | 350ms |
| /api/products | 30ms | 80ms | 200ms |

### Resources
| Component | Usage | Limit |
|-----------|-------|-------|
| App CPU | 65% | 100% |
| App Memory | 4.2GB | 8GB |
| DB Connections | 45 | 100 |
| Redis Memory | 2.1GB | 4GB |

### Identified Bottlenecks
1. /api/orders p99 high (350ms)
2. DB connections at 45% under normal load

Formulating Hypotheses

Hypothesis template

## Hypothesis #1

**Observation**:
p99 of /api/orders is 350ms, while p50 is 45ms

**Analysis**:
- Trace shows variance in history query
- Query full scan when history > 1000 items

**Hypothesis**:
Adding index on orders(user_id, created_at) will reduce p99

**Prediction**:
- Current p99: 350ms
- Expected p99: < 100ms
- Reduction: > 70%

**Risk**:
- Index adds ~5% overhead on writes
- Additional space: ~500MB

**Decision**: Proceed (write overhead acceptable)

Prioritizing hypotheses

Prioritization criteria:
  Impact:
    - How many users affected?
    - How much time saved?
    - What business value?

  Effort:
    - How long to implement?
    - What complexity?
    - What risk?

  Confidence:
    - How certain is the hypothesis?
    - Do we have enough data?

Decision matrix:
  High impact + Low effort + High confidence → DO FIRST
  High impact + High effort + High confidence → PLAN
  Low impact + Any effort → IGNORE
  Any + Low confidence → INVESTIGATE MORE

Executing Controlled Tests

Test environment

Requirements:
  - Similar to production (data, load)
  - Isolated (no interference)
  - Monitored (all metrics)
  - Reproducible (same conditions)

Process:
  1. Capture baseline in test environment
  2. Apply single change
  3. Execute same load
  4. Collect metrics
  5. Compare with baseline

One change at a time

❌ Wrong:
  "I'll increase memory, threads and timeout at once"
  → Don't know which change caused the effect

✅ Correct:
  Test 1: Increase memory → Measure
  Test 2: Increase threads → Measure
  Test 3: Increase timeout → Measure
  → Know the impact of each change

Adequate duration

Minimum time:
  - Smoke test: 5 minutes (validate it works)
  - Baseline: 30 minutes (stabilize)
  - Change test: 30 minutes (same time)

Why:
  - JIT needs to warm up
  - Caches need to populate
  - Metrics need to stabilize
  - Variance needs to be captured

Analyzing Results

Structured comparison

## Result - Hypothesis #1 (Index on orders)

### Before/After Metrics
| Metric | Before | After | Δ |
|--------|--------|-------|---|
| p50 | 45ms | 42ms | -7% |
| p95 | 120ms | 65ms | -46% |
| p99 | 350ms | 85ms | -76% ✓ |
| Write latency | 5ms | 5.2ms | +4% |
| DB CPU | 35% | 32% | -9% |

### Hypothesis Validation
- Prediction: p99 < 100ms
- Result: p99 = 85ms
- Status: ✅ Confirmed

### Side Effects
- Write overhead: +4% (acceptable)
- Disk space: +450MB (acceptable)

### Decision
Apply to production ✓

Statistical significance

Cautions:
  - Natural variance exists
  - Small sample can deceive
  - Compare distributions, not just averages

Techniques:
  - T-test for comparing means
  - Mann-Whitney for distributions
  - Bootstrap for confidence intervals

Rule of thumb:
  - Difference > 10%: probably significant
  - Difference < 5%: may be noise
  - Between 5-10%: needs more data

Common Types of Tuning

JVM Tuning

Common parameters:
  Heap size:
    -Xms, -Xmx
    Hypothesis: "Increasing heap reduces GC"
    Metric: GC time, GC frequency

  GC algorithm:
    -XX:+UseG1GC, -XX:+UseZGC
    Hypothesis: "G1 better for latency"
    Metric: p99, GC pause time

  Thread pools:
    -XX:ParallelGCThreads
    Hypothesis: "More GC threads reduces pause"
    Metric: GC pause duration

Test example:
  Baseline: -Xmx4g -XX:+UseG1GC
  Test 1: -Xmx8g -XX:+UseG1GC
  Test 2: -Xmx4g -XX:+UseZGC
  → Compare p99 and throughput

Database Tuning

PostgreSQL:
  shared_buffers:
    Hypothesis: "More buffer = less disk I/O"
    Metric: buffer hit ratio, disk reads

  work_mem:
    Hypothesis: "More work_mem = in-memory sorts"
    Metric: temp file usage, query time

  max_connections:
    Hypothesis: "More connections = more concurrency"
    Caution: Can have inverse effect!

Test example:
  Baseline: shared_buffers = 1GB
  Test 1: shared_buffers = 2GB
  Test 2: shared_buffers = 4GB
  → Measure buffer hit ratio and latency

Connection Pool Tuning

Parameters:
  Pool size:
    Formula: connections = (cores * 2) + disk_spindles
    Hypothesis: "Pool too large causes contention"

  Timeout:
    Hypothesis: "Short timeout fails fast"

  Idle timeout:
    Hypothesis: "Keeping connections avoids overhead"

Metrics:
  - Connection wait time
  - Pool utilization
  - Timeout errors
  - DB connection count

Documenting Tuning

Documentation template

# Tuning Log - System XYZ

## Entry #1 - 2024-01-15

### Change
Parameter: PostgreSQL shared_buffers
Previous value: 1GB
New value: 4GB

### Motivation
Buffer hit ratio at 85%, target is > 95%

### Test
- Environment: Staging
- Load: 500 req/s for 1 hour
- Baseline captured: yes

### Results
| Metric | Before | After |
|--------|--------|-------|
| Buffer hit ratio | 85% | 96% |
| Avg query time | 15ms | 8ms |
| p99 query time | 150ms | 45ms |

### Decision
Applied to production on 2024-01-16

### Follow-up (1 week later)
Results maintained in production ✓

Tuning library

Keep record of:
  - Changes that worked
  - Changes that didn't work
  - Context (version, load, hardware)
  - Observed side effects

Benefits:
  - Avoids repeating mistakes
  - Accelerates troubleshooting
  - Onboarding new members
  - Base for automation

Tuning Anti-Patterns

1. Cargo cult tuning

❌ "I read that 4GB heap is good"
   → Applies without measuring

✅ "I'll test 2GB, 4GB and 8GB"
   → Measures each configuration in real context

2. Tuning without baseline

❌ "I increased threads and it seems faster"
   → No data to compare

✅ "Baseline: 100ms p95. After change: 60ms p95"
   → Quantified improvement

3. Multiple simultaneous changes

❌ "I changed heap, threads and timeout. It got better!"
   → Which change helped? Which might have hurt?

✅ One change at a time
   → Understands the impact of each adjustment

4. Ignoring side effects

❌ "p99 improved 50%!"
   → But throughput dropped 30%

✅ Evaluate all relevant metrics
   → Net positive improvement

Conclusion

Data-driven tuning means:

Measure before - documented baseline
Formulate hypothesis - quantified prediction
Test isolated - one change at a time
Validate result - compare with prediction
Document - create reusable knowledge

The result: changes that demonstrably improve, not guesses that might make things worse.

Tuning without data is gambling. Tuning with data is engineering.

This article is part of the series on the OCTOPUS Performance Engineering methodology.