"The test showed 150ms latency." Ok, but what does that mean? Is it good? Bad? Compared to what? Performance test numbers without context and interpretation are just digits. This article teaches how to extract real meaning from results.
Data isn't insight. Interpretation transforms data into knowledge.
The Problem with Raw Numbers
Numbers without context
Test result:
- Latency p95: 250ms
- Throughput: 1500 req/s
- Error rate: 0.5%
Questions numbers don't answer:
- Does this meet requirements?
- Compared to baseline, did it improve or worsen?
- Which endpoints contributed?
- Was it stable or did it vary during the test?
- Was the test environment representative?
The risk of superficial interpretation
Scenario:
"Average latency: 100ms. Test passed!"
Reality:
- p50: 50ms
- p95: 200ms
- p99: 2000ms
- Max: 30s
Real conclusion:
Average hides that 1% of users have
terrible experience (>2s)
Interpretation Framework
1. Context first
Before looking at numbers, ask:
- What was the test objective?
- What load was applied vs expected?
- What environment was used?
- How long did it run?
- Was there realistic data?
2. Comparison with baseline
Isolated result:
p95 = 200ms
With baseline:
Baseline: p95 = 180ms
Test: p95 = 200ms
Δ: +11%
Interpretation:
11% regression - investigate or accept?
3. Distribution, not average
Always use:
- p50 (median) - typical experience
- p95 - most users
- p99 - common worst case
- Max - outliers
Never trust only:
- Average (hides variance)
- Min (irrelevant)
4. Trend over time
Constant latency:
───────────────────
Good: Stable system
Increasing latency:
╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱
Problem: Memory leak? Queue buildup?
Latency with spikes:
─╲─╱─╲─╱─╲─╱─╲─╱─
Investigate: GC? Background jobs? External API?
Analyzing by Layer
Endpoint analysis
Aggregated result:
p95 = 300ms
By endpoint:
GET /api/products: p95 = 100ms ✓
GET /api/product/:id: p95 = 150ms ✓
POST /api/checkout: p95 = 2s ✗ ← Problem!
Insight:
Checkout needs specific attention
Component analysis
End-to-end result:
p95 = 500ms
Breakdown:
API Gateway: 50ms (10%)
Auth Service: 80ms (16%)
Order Service: 120ms (24%)
DB Query: 200ms (40%) ← Largest contributor
Serialization: 50ms (10%)
Insight:
DB optimization will have greatest impact
Resource analysis
Performance metrics OK, but:
CPU: 95%
Memory: 7.8GB / 8GB
DB Connections: 95 / 100
Interpretation:
System operating at its limit
No headroom for growth
Next spike may cause failure
Identifying Patterns
Important correlations
Latency rises when:
- CPU > 80%? → CPU-bound
- Memory > 90%? → GC stress
- Connections > 80%? → Connection starvation
- Throughput increases? → Saturation
Errors appear when:
- Timeout? → Slow dependency
- 5xx? → Application failure
- Connection refused? → Pool exhausted
Red flags
Watch for:
- High variance (p99/p50 > 10x)
- Errors increasing over time
- Latency that doesn't stabilize
- Throughput dropping during test
- Resources at 100% (any of them)
False positives
Be careful with:
- First minute (warm-up)
- Single spikes (may be outliers)
- Non-representative environment
- Cache too hot or cold
- Unrealistic test data
Results Reporting
Recommended structure
# Test Report - [Name]
## 1. Executive Summary
- Status: PASSED / FAILED / WITH CAVEATS
- Validated capacity: X req/s
- Key findings: [bullets]
- Recommended actions: [bullets]
## 2. Test Context
- Date: [when]
- Environment: [where]
- Load: [how much]
- Duration: [time]
- Scenario: [description]
## 3. Results vs Criteria
| Metric | Criterion | Result | Status |
|--------|-----------|--------|--------|
| p95 latency | < 500ms | 380ms | ✓ |
| Error rate | < 1% | 0.3% | ✓ |
| Throughput | > 1000 req/s | 1250 | ✓ |
## 4. Detailed Analysis
### By Endpoint
[Table with breakdown]
### By Component
[Table with breakdown]
### Temporal Trend
[Graph and observations]
## 5. Observations and Risks
- [Observation 1]
- [Identified risk]
## 6. Recommendations
- [Action 1 - high priority]
- [Action 2 - medium priority]
## 7. Next Steps
- [What to do with these results]
Essential visualizations
Graphs that help:
1. Latency over time:
- Shows stability
- Reveals trends
2. Latency distribution (histogram):
- Shows dispersion
- Identifies multimodality
3. Throughput vs Latency:
- Shows saturation point
- Identifies correlation
4. Resources vs Time:
- CPU, Memory, Connections
- Correlates with performance
Common Interpretations
"The test passed"
Follow-up question:
- With what margin?
- What's the headroom?
- Close to the limit?
Example:
Criterion: p95 < 500ms
Result: p95 = 480ms
Technically passed, but:
- 4% margin
- Any degradation = failure
- Recommendation: optimize before production
"The test failed"
Don't stop at failure:
- Where did it fail first?
- How much to pass?
- Was it consistent or intermittent?
Example:
Criterion: p95 < 500ms
Result: p95 = 650ms
Analysis:
- Failed by 30%
- Bottleneck: DB (contributes 400ms)
- Optimizing DB may solve it
"Inconsistent results"
High variance indicates:
- Unstable environment
- Non-deterministic GC
- External dependencies
- Non-uniform load
Action:
- Run more times
- Increase duration
- Isolate variables
- Investigate spikes
Interpretation Anti-Patterns
1. Cherry-picking
❌ "The best result was 100ms"
✅ "Median was 150ms, best was 100ms, worst was 2s"
2. Ignoring warm-up
❌ Including first 2 minutes in analysis
✅ Discard warm-up, analyze steady-state
3. Comparing incomparables
❌ "Prod has 200ms, test has 300ms, 50% regression"
(If test environment is different)
✅ "Comparing same environment, before/after the change"
4. Average as truth
❌ "100ms average, excellent!"
(When p99 is 5s)
✅ "p50=80ms, p95=200ms, p99=5s - outliers are a problem"
Conclusion
Interpreting results correctly requires:
- Context - compare with baseline and requirements
- Distribution - percentiles, not averages
- Trend - behavior over time
- Breakdown - by endpoint and component
- Correlation - performance vs resources
Numbers are the beginning, not the end. The value is in the insight you extract.
The test generates data. You generate knowledge.
This article is part of the series on the OCTOPUS Performance Engineering methodology.