Performance modeling is the art of predicting system behavior without needing to test exhaustively. With correct models, you can answer questions like "how many users can it handle?" without running expensive tests.
Model is map, not territory. Useful for navigating, dangerous to blindly trust.
Why Model
Testing costs
Real load test:
- Infrastructure: $500-5000
- Preparation time: 2-5 days
- Execution: 1-4 hours
- Analysis: 1-2 days
Mathematical model:
- Calculation: minutes
- Cost: $0
- Iterations: unlimited
When modeling is worth it
✓ Capacity planning before buying infra
✓ Quick estimates for stakeholders
✓ Compare hypothetical scenarios
✓ Understand theoretical limits
✓ Validate intuition before testing
Fundamentals: Little's Law
The most useful law in performance modeling:
L = λ × W
L = average number of items in system
λ = arrival rate (throughput)
W = average time in system (latency)
Practical examples
1. Database connections
Throughput: 100 queries/s
Average latency: 50ms = 0.05s
Active connections = 100 × 0.05 = 5 connections
2. Server capacity
Each server handles 100 simultaneous connections
Latency per request: 200ms
Throughput per server = 100 / 0.2 = 500 req/s
3. Sizing infrastructure
Goal: 10,000 req/s
Expected latency: 100ms
Simultaneous connections = 10,000 × 0.1 = 1,000
If each server handles 200 connections:
Servers needed = 1,000 / 200 = 5
Queuing Theory
M/M/1 Model
System with one queue and one server:
λ = arrival rate
μ = service rate
ρ = λ/μ (utilization)
For stable system: ρ < 1
Average time in system:
W = 1 / (μ - λ)
Example:
Arrivals: 80 req/s (λ)
Service: 100 req/s (μ)
Utilization: 80/100 = 80%
Time in system: 1/(100-80) = 50ms
Utilization vs Latency:
ρ = 50%: W = 1/(100-50) = 20ms
ρ = 80%: W = 1/(100-80) = 50ms
ρ = 90%: W = 1/(100-90) = 100ms
ρ = 95%: W = 1/(100-95) = 200ms
ρ = 99%: W = 1/(100-99) = 1000ms
→ Latency explodes near 100% utilization
M/M/c Model
Multiple servers:
c = number of servers
For stable system: ρ = λ/(c×μ) < 1
Example: Connection pool
Requests: 200 req/s
Time per query: 20ms (μ = 50/s)
With 1 connection: 200/50 = 4 (unstable!)
With 4 connections: 200/(4×50) = 1 (limit!)
With 5 connections: 200/(5×50) = 0.8 (stable)
With 10 connections: 200/(10×50) = 0.4 (margin)
Universal Scalability Law (USL)
Model that captures scalability limits:
C(N) = N / (1 + σ(N-1) + κN(N-1))
N = number of processors/servers
σ = contention coefficient
κ = coherence coefficient
C(N) = relative capacity
Interpretation
σ (sigma): serialization overhead
- Critical sections, locks
- Higher σ, worse scaling
κ (kappa): coordination overhead
- Communication between nodes
- Higher κ, worse scaling
Scalability profiles
σ = 0, κ = 0: Linear (ideal)
N servers = N× capacity
σ > 0, κ = 0: Sublinear
Scales, but with diminishing returns
σ > 0, κ > 0: Retrograde
Maximum point exists, then degrades
Practical example:
System with σ=0.1, κ=0.01
N=1: C = 1.0
N=2: C = 1.82 (1.82x)
N=4: C = 3.08 (0.77x efficiency)
N=8: C = 4.68 (0.58x efficiency)
N=16: C = 5.76 (0.36x efficiency)
N=32: C = 5.41 (worse than 16!)
→ Maximum ~22 servers for this system
Applying Models
Step 1: Measure parameters
# Measure service rate (μ)
latencies = collect_latencies(sample_size=1000)
mu = 1 / mean(latencies)
# Measure arrival rate (λ)
arrivals = count_arrivals(window='1 minute')
lambda_ = arrivals / 60
# Calculate utilization
rho = lambda_ / mu
Step 2: Validate model
# Calculate prediction
predicted_latency = 1 / (mu - lambda_)
# Compare with observed
observed_latency = mean(latencies)
error = abs(predicted - observed) / observed
if error > 0.2:
print("Model doesn't apply well")
Step 3: Project
# How much can it handle with latency < 100ms?
target_latency = 0.1 # 100ms
# W = 1/(μ-λ) → λ = μ - 1/W
max_lambda = mu - (1 / target_latency)
print(f"Maximum capacity: {max_lambda} req/s")
Model Limitations
1. Assumed distributions
M/M/c models assume:
- Poisson arrivals (exponential)
- Exponential service
Reality:
- Bursty arrivals
- High variance service
- Dependencies between requests
2. Closed vs open system
Open system: independent arrivals
Closed system: fixed N users
Wrong model = wrong prediction
3. Warm-up and states
Model assumes steady-state
Ignores:
- Cold start
- Cache warming
- JIT compilation
- Connection pooling
4. Dependencies
Single component model ignores:
- Network latency
- External dependencies
- Contention on shared resources
Practical Simplified Model
When formal models are too complex:
80% Rule
Don't operate above 80% utilization
Margin for spikes and variance
Safety factor
Needed capacity = Peak × 1.5
If expected peak = 1000 req/s
Provision for 1500 req/s
Linear extrapolation with margin
Current: 2 servers, 500 req/s
Goal: 2000 req/s
Linear: 2000/500 × 2 = 8 servers
With 50% margin: 12 servers
With coordination overhead: 15 servers
Tool: Quick Calculator
def capacity_estimate(
current_throughput: float,
current_servers: int,
target_throughput: float,
efficiency: float = 0.7 # Assume 70% efficiency
) -> int:
"""
Estimates needed servers
"""
throughput_per_server = current_throughput / current_servers
ideal_servers = target_throughput / throughput_per_server
real_servers = ideal_servers / efficiency
return ceil(real_servers)
# Example
servers = capacity_estimate(
current_throughput=1000,
current_servers=4,
target_throughput=5000
)
print(f"Servers needed: {servers}")
# Output: Servers needed: 29
Conclusion
Performance models are powerful tools, but with limitations:
Use models for:
- Quick estimates before tests
- Initial capacity planning
- Identifying theoretical limits
- Comparing hypothetical scenarios
Don't use models for:
- Final decisions without validation
- Complex systems with many dependencies
- Performance guarantees
Recommended workflow:
1. Model → Initial estimate
2. Test → Model validation
3. Adjust → Refine parameters
4. Monitor → Continuous validation
All models are wrong, but some are useful. — George Box