Performance Modeling: predicting behavior under load

Performance modeling is the art of predicting system behavior without needing to test exhaustively. With correct models, you can answer questions like "how many users can it handle?" without running expensive tests.

Model is map, not territory. Useful for navigating, dangerous to blindly trust.

Why Model

Testing costs

Real load test:
- Infrastructure: $500-5000
- Preparation time: 2-5 days
- Execution: 1-4 hours
- Analysis: 1-2 days

Mathematical model:
- Calculation: minutes
- Cost: $0
- Iterations: unlimited

When modeling is worth it

✓ Capacity planning before buying infra
✓ Quick estimates for stakeholders
✓ Compare hypothetical scenarios
✓ Understand theoretical limits
✓ Validate intuition before testing

Fundamentals: Little's Law

The most useful law in performance modeling:

L = λ × W

L = average number of items in system
λ = arrival rate (throughput)
W = average time in system (latency)

Practical examples

1. Database connections

Throughput: 100 queries/s
Average latency: 50ms = 0.05s

Active connections = 100 × 0.05 = 5 connections

2. Server capacity

Each server handles 100 simultaneous connections
Latency per request: 200ms

Throughput per server = 100 / 0.2 = 500 req/s

3. Sizing infrastructure

Goal: 10,000 req/s
Expected latency: 100ms
Simultaneous connections = 10,000 × 0.1 = 1,000

If each server handles 200 connections:
Servers needed = 1,000 / 200 = 5

Queuing Theory

M/M/1 Model

System with one queue and one server:

λ = arrival rate
μ = service rate
ρ = λ/μ (utilization)

For stable system: ρ < 1

Average time in system:

W = 1 / (μ - λ)

Example:

Arrivals: 80 req/s (λ)
Service: 100 req/s (μ)
Utilization: 80/100 = 80%

Time in system: 1/(100-80) = 50ms

Utilization vs Latency:

ρ = 50%:  W = 1/(100-50) = 20ms
ρ = 80%:  W = 1/(100-80) = 50ms
ρ = 90%:  W = 1/(100-90) = 100ms
ρ = 95%:  W = 1/(100-95) = 200ms
ρ = 99%:  W = 1/(100-99) = 1000ms

→ Latency explodes near 100% utilization

M/M/c Model

Multiple servers:

c = number of servers

For stable system: ρ = λ/(c×μ) < 1

Example: Connection pool

Requests: 200 req/s
Time per query: 20ms (μ = 50/s)

With 1 connection: 200/50 = 4 (unstable!)
With 4 connections: 200/(4×50) = 1 (limit!)
With 5 connections: 200/(5×50) = 0.8 (stable)
With 10 connections: 200/(10×50) = 0.4 (margin)

Universal Scalability Law (USL)

Model that captures scalability limits:

C(N) = N / (1 + σ(N-1) + κN(N-1))

N = number of processors/servers
σ = contention coefficient
κ = coherence coefficient
C(N) = relative capacity

Interpretation

σ (sigma): serialization overhead
- Critical sections, locks
- Higher σ, worse scaling

κ (kappa): coordination overhead
- Communication between nodes
- Higher κ, worse scaling

Scalability profiles

σ = 0, κ = 0: Linear (ideal)
N servers = N× capacity

σ > 0, κ = 0: Sublinear
Scales, but with diminishing returns

σ > 0, κ > 0: Retrograde
Maximum point exists, then degrades

Practical example:

System with σ=0.1, κ=0.01

N=1:  C = 1.0
N=2:  C = 1.82 (1.82x)
N=4:  C = 3.08 (0.77x efficiency)
N=8:  C = 4.68 (0.58x efficiency)
N=16: C = 5.76 (0.36x efficiency)
N=32: C = 5.41 (worse than 16!)

→ Maximum ~22 servers for this system

Applying Models

Step 1: Measure parameters

# Measure service rate (μ)
latencies = collect_latencies(sample_size=1000)
mu = 1 / mean(latencies)

# Measure arrival rate (λ)
arrivals = count_arrivals(window='1 minute')
lambda_ = arrivals / 60

# Calculate utilization
rho = lambda_ / mu

Step 2: Validate model

# Calculate prediction
predicted_latency = 1 / (mu - lambda_)

# Compare with observed
observed_latency = mean(latencies)

error = abs(predicted - observed) / observed
if error > 0.2:
    print("Model doesn't apply well")

Step 3: Project

# How much can it handle with latency < 100ms?
target_latency = 0.1  # 100ms

# W = 1/(μ-λ) → λ = μ - 1/W
max_lambda = mu - (1 / target_latency)

print(f"Maximum capacity: {max_lambda} req/s")

Model Limitations

1. Assumed distributions

M/M/c models assume:
- Poisson arrivals (exponential)
- Exponential service

Reality:
- Bursty arrivals
- High variance service
- Dependencies between requests

2. Closed vs open system

Open system: independent arrivals
Closed system: fixed N users

Wrong model = wrong prediction

3. Warm-up and states

Model assumes steady-state
Ignores:
- Cold start
- Cache warming
- JIT compilation
- Connection pooling

4. Dependencies

Single component model ignores:
- Network latency
- External dependencies
- Contention on shared resources

Practical Simplified Model

When formal models are too complex:

80% Rule

Don't operate above 80% utilization
Margin for spikes and variance

Safety factor

Needed capacity = Peak × 1.5

If expected peak = 1000 req/s
Provision for 1500 req/s

Linear extrapolation with margin

Current: 2 servers, 500 req/s
Goal: 2000 req/s

Linear: 2000/500 × 2 = 8 servers
With 50% margin: 12 servers
With coordination overhead: 15 servers

Tool: Quick Calculator

def capacity_estimate(
    current_throughput: float,
    current_servers: int,
    target_throughput: float,
    efficiency: float = 0.7  # Assume 70% efficiency
) -> int:
    """
    Estimates needed servers
    """
    throughput_per_server = current_throughput / current_servers
    ideal_servers = target_throughput / throughput_per_server
    real_servers = ideal_servers / efficiency

    return ceil(real_servers)

# Example
servers = capacity_estimate(
    current_throughput=1000,
    current_servers=4,
    target_throughput=5000
)
print(f"Servers needed: {servers}")
# Output: Servers needed: 29

Conclusion

Performance models are powerful tools, but with limitations:

Use models for:

Quick estimates before tests
Initial capacity planning
Identifying theoretical limits
Comparing hypothetical scenarios

Don't use models for:

Final decisions without validation
Complex systems with many dependencies
Performance guarantees

Recommended workflow:

1. Model → Initial estimate
2. Test → Model validation
3. Adjust → Refine parameters
4. Monitor → Continuous validation

All models are wrong, but some are useful. — George Box