"The system handles 2000 users. We expect 5000 on Black Friday." This isn't a technical problem — it's a business decision. Do we invest to scale? Accept degradation risk? Limit sales? This article teaches how to transform technical results into executive options.
Capacity is a business decision, not a technical constant.
The Executive Nature of Stress Testing
What engineering delivers
Technical result:
- Current capacity: 2000 req/s
- Breaking point: 2800 req/s
- Bottleneck: Database connections
- Failure mode: Graceful degradation
This answers: "What can the system do?"
What business decides
Executive decision:
- What capacity do we need?
- How much to invest to achieve it?
- What risk is acceptable?
- What trade-off to choose?
This answers: "What do we want the system to do?"
Decision Framework
Mapping options
Situation: Current capacity 2000, need 5000
Option A - Scale:
Cost: $50K in infra + 2 sprints
Result: Capacity for 6000
Risk: Low
Trade-off: Upfront investment
Option B - Optimize:
Cost: 3 engineering sprints
Result: Capacity for 4000
Risk: Medium (may not be enough)
Trade-off: Engineering time
Option C - Accept degradation:
Cost: Implement rate limiting + queues
Result: Serve 2000 with quality, rest in queue
Risk: High (degraded experience)
Trade-off: Impaired UX
Option D - Limit demand:
Cost: Less marketing promotion
Result: Demand within capacity
Risk: Low
Trade-off: Potential revenue lost
Decision criteria
Questions for stakeholders:
1. Cost of failure:
"If the system goes down for 2 hours at peak,
how much do we lose?"
→ Defines maximum budget for mitigation
2. Peak probability:
"How confident are we in the 5000 user forecast?"
→ Adjusts risk tolerance
3. Degradation tolerance:
"Will users accept 30s wait at checkout?"
→ Enables graceful degradation options
4. Brand impact:
"What's the reputational damage of public failure?"
→ May justify larger investment
Communicating Results to Executives
What doesn't work
❌ "P95 latency rises to 2s above 2500 req/s"
→ Executive doesn't know what to do with this
❌ "We need 4 more EC2 c5.2xlarge instances"
→ No context of why or alternatives
❌ "The system will crash if there's too much load"
→ Alarmist and vague
What works
✅ "With expected Black Friday load, 40% of customers
will have degraded experience (checkout > 10s)
or won't be able to complete purchase."
✅ "We have 3 options:
- Invest $50K now to guarantee capacity
- Spend 2 sprints optimizing (risk of not solving)
- Accept ~$200K loss in sales at peak"
✅ "We recommend option A for ROI: $50K to avoid
potential loss of $200K+. Decision needed by
[date] for implementation."
Executive presentation template
# Capacity Decision Brief - Black Friday 2024
## Situation
- Expected load: 5000 simultaneous users
- Current capacity: 2000 users
- Gap: 60%
## Impact if no action
- 60% of users with degraded experience
- Estimated loss: $200-400K in sales
- Reputational risk: Social media complaints
## Options
| Option | Investment | Result | Risk |
|--------|------------|--------|------|
| A: Scale infra | $50K | 100% capacity | Low |
| B: Optimization | 2 sprints ($30K) | 80% capacity | Medium |
| C: Rate limiting | 1 sprint ($15K) | Controlled queue | High |
| D: Reduce marketing | $0 | Lower demand | Low |
## Recommendation
Option A: Best ROI considering potential revenue
## Timeline
- Decision needed: [date - 30 days]
- Implementation: 2 weeks
- Validation test: 1 week
- Safety margin: 1 week
## Next step
Budget approval to start provisioning
Quantifying Risk
Impact model
Scenario: Black Friday without scaling
Probability of peak > capacity: 70%
If it occurs:
- Users affected: 3000 (60% of peak)
- Normal conversion rate: 3%
- Degraded rate: 0.5%
- Average ticket: $100
- Peak duration: 4 hours
Calculation:
Lost conversions: 3000 × (3% - 0.5%) = 75
Revenue lost per hour: 75 × $100 = $7,500
Total revenue lost: $7,500 × 4 = $30,000
Adjusted by probability: $30,000 × 70% = $21,000
Conclusion:
Investment of $15K to avoid is justified
Risk matrix
│ Low Impact │ High Impact
───────────┼───────────────┼──────────────
High │ Monitor │ ACT NOW
Probability│ │
───────────┼───────────────┼──────────────
Low │ Accept │ Contingency
Probability│ │ plan
Example:
- Black Friday: High prob. + High impact → ACT
- Viral bug: Low prob. + High impact → Contingency
- Daily peak: High prob. + Low impact → Monitor
Common Trade-offs
Performance vs Cost
Option A: 10 small servers
- Cost: $500/month
- Capacity: 3000 req/s
- Latency: p95 = 500ms
Option B: 3 large servers
- Cost: $600/month
- Capacity: 3000 req/s
- Latency: p95 = 200ms
Decision: How much is 300ms latency worth?
- If increases conversion by 0.5%: worth $600/month
- If doesn't impact conversion: choose A
Capacity vs Development time
Option A: Scale hardware
- Time: 1 week
- Recurring cost: $2K/month
- Sustainable: limited
Option B: Optimize code
- Time: 4 weeks
- Recurring cost: $0
- Sustainable: indefinitely
Decision:
- If event in 2 weeks: Option A
- If event in 2 months: Option B
- If both: A now, B later
Availability vs Cost
99.9% availability:
- Downtime: 8.7 hours/year
- Cost: $10K/month
- Complexity: Low
99.99% availability:
- Downtime: 52 minutes/year
- Cost: $50K/month
- Complexity: High
Decision:
- Cost of downtime/hour: $X
- Break-even: when $40K/month < 8h × $X
- If $X > $5K/hour: 99.99% is worth it
Documenting Decisions
Decision template
# Decision: Black Friday 2024 Capacity
## Context
- Date: 2024-10-15
- Decision maker: [Name, Title]
- Participants: [Engineering, Product, Finance]
## Situation
[Problem summary]
## Options Considered
1. [Option A]: [Description]
2. [Option B]: [Description]
3. [Option C]: [Description]
## Decision
Chosen: [Option X]
Reason: [Justification]
## Accepted Trade-offs
- [What we're giving up]
- [Risks we're accepting]
## Success Metrics
- [How we'll know if it worked]
## Review
- Date: [When to reevaluate]
- Triggers: [What would change the decision]
Conclusion
Stress testing informs business decisions:
- Translate technical to impact - numbers executives understand
- Present options - not just problems
- Quantify risk - probability × impact
- Recommend with ROI - justify investment
- Document decision - for accountability
Capacity isn't determined by engineering — it's chosen by business based on trade-offs that engineering clarifies.
Engineering's job is to give options. Executives' job is to choose between them.
This article is part of the series on the OCTOPUS Performance Engineering methodology.