Methodology7 min

Stress as Executive Decision: when to risk, when to invest

Stress testing results aren't just technical. They inform business decisions about risk, investment, and trade-offs.

"The system handles 2000 users. We expect 5000 on Black Friday." This isn't a technical problem — it's a business decision. Do we invest to scale? Accept degradation risk? Limit sales? This article teaches how to transform technical results into executive options.

Capacity is a business decision, not a technical constant.

The Executive Nature of Stress Testing

What engineering delivers

Technical result:
  - Current capacity: 2000 req/s
  - Breaking point: 2800 req/s
  - Bottleneck: Database connections
  - Failure mode: Graceful degradation

This answers: "What can the system do?"

What business decides

Executive decision:
  - What capacity do we need?
  - How much to invest to achieve it?
  - What risk is acceptable?
  - What trade-off to choose?

This answers: "What do we want the system to do?"

Decision Framework

Mapping options

Situation: Current capacity 2000, need 5000

Option A - Scale:
  Cost: $50K in infra + 2 sprints
  Result: Capacity for 6000
  Risk: Low
  Trade-off: Upfront investment

Option B - Optimize:
  Cost: 3 engineering sprints
  Result: Capacity for 4000
  Risk: Medium (may not be enough)
  Trade-off: Engineering time

Option C - Accept degradation:
  Cost: Implement rate limiting + queues
  Result: Serve 2000 with quality, rest in queue
  Risk: High (degraded experience)
  Trade-off: Impaired UX

Option D - Limit demand:
  Cost: Less marketing promotion
  Result: Demand within capacity
  Risk: Low
  Trade-off: Potential revenue lost

Decision criteria

Questions for stakeholders:

1. Cost of failure:
   "If the system goes down for 2 hours at peak,
    how much do we lose?"
   → Defines maximum budget for mitigation

2. Peak probability:
   "How confident are we in the 5000 user forecast?"
   → Adjusts risk tolerance

3. Degradation tolerance:
   "Will users accept 30s wait at checkout?"
   → Enables graceful degradation options

4. Brand impact:
   "What's the reputational damage of public failure?"
   → May justify larger investment

Communicating Results to Executives

What doesn't work

❌ "P95 latency rises to 2s above 2500 req/s"
   → Executive doesn't know what to do with this

❌ "We need 4 more EC2 c5.2xlarge instances"
   → No context of why or alternatives

❌ "The system will crash if there's too much load"
   → Alarmist and vague

What works

✅ "With expected Black Friday load, 40% of customers
    will have degraded experience (checkout > 10s)
    or won't be able to complete purchase."

✅ "We have 3 options:
    - Invest $50K now to guarantee capacity
    - Spend 2 sprints optimizing (risk of not solving)
    - Accept ~$200K loss in sales at peak"

✅ "We recommend option A for ROI: $50K to avoid
    potential loss of $200K+. Decision needed by
    [date] for implementation."

Executive presentation template

# Capacity Decision Brief - Black Friday 2024

## Situation
- Expected load: 5000 simultaneous users
- Current capacity: 2000 users
- Gap: 60%

## Impact if no action
- 60% of users with degraded experience
- Estimated loss: $200-400K in sales
- Reputational risk: Social media complaints

## Options

| Option | Investment | Result | Risk |
|--------|------------|--------|------|
| A: Scale infra | $50K | 100% capacity | Low |
| B: Optimization | 2 sprints ($30K) | 80% capacity | Medium |
| C: Rate limiting | 1 sprint ($15K) | Controlled queue | High |
| D: Reduce marketing | $0 | Lower demand | Low |

## Recommendation
Option A: Best ROI considering potential revenue

## Timeline
- Decision needed: [date - 30 days]
- Implementation: 2 weeks
- Validation test: 1 week
- Safety margin: 1 week

## Next step
Budget approval to start provisioning

Quantifying Risk

Impact model

Scenario: Black Friday without scaling

Probability of peak > capacity: 70%
If it occurs:
  - Users affected: 3000 (60% of peak)
  - Normal conversion rate: 3%
  - Degraded rate: 0.5%
  - Average ticket: $100
  - Peak duration: 4 hours

Calculation:
  Lost conversions: 3000 × (3% - 0.5%) = 75
  Revenue lost per hour: 75 × $100 = $7,500
  Total revenue lost: $7,500 × 4 = $30,000
  Adjusted by probability: $30,000 × 70% = $21,000

Conclusion:
  Investment of $15K to avoid is justified

Risk matrix

           │ Low Impact    │ High Impact
───────────┼───────────────┼──────────────
High       │ Monitor       │ ACT NOW
Probability│               │
───────────┼───────────────┼──────────────
Low        │ Accept        │ Contingency
Probability│               │ plan

Example:
  - Black Friday: High prob. + High impact → ACT
  - Viral bug: Low prob. + High impact → Contingency
  - Daily peak: High prob. + Low impact → Monitor

Common Trade-offs

Performance vs Cost

Option A: 10 small servers
  - Cost: $500/month
  - Capacity: 3000 req/s
  - Latency: p95 = 500ms

Option B: 3 large servers
  - Cost: $600/month
  - Capacity: 3000 req/s
  - Latency: p95 = 200ms

Decision: How much is 300ms latency worth?
  - If increases conversion by 0.5%: worth $600/month
  - If doesn't impact conversion: choose A

Capacity vs Development time

Option A: Scale hardware
  - Time: 1 week
  - Recurring cost: $2K/month
  - Sustainable: limited

Option B: Optimize code
  - Time: 4 weeks
  - Recurring cost: $0
  - Sustainable: indefinitely

Decision:
  - If event in 2 weeks: Option A
  - If event in 2 months: Option B
  - If both: A now, B later

Availability vs Cost

99.9% availability:
  - Downtime: 8.7 hours/year
  - Cost: $10K/month
  - Complexity: Low

99.99% availability:
  - Downtime: 52 minutes/year
  - Cost: $50K/month
  - Complexity: High

Decision:
  - Cost of downtime/hour: $X
  - Break-even: when $40K/month < 8h × $X
  - If $X > $5K/hour: 99.99% is worth it

Documenting Decisions

Decision template

# Decision: Black Friday 2024 Capacity

## Context
- Date: 2024-10-15
- Decision maker: [Name, Title]
- Participants: [Engineering, Product, Finance]

## Situation
[Problem summary]

## Options Considered
1. [Option A]: [Description]
2. [Option B]: [Description]
3. [Option C]: [Description]

## Decision
Chosen: [Option X]
Reason: [Justification]

## Accepted Trade-offs
- [What we're giving up]
- [Risks we're accepting]

## Success Metrics
- [How we'll know if it worked]

## Review
- Date: [When to reevaluate]
- Triggers: [What would change the decision]

Conclusion

Stress testing informs business decisions:

  1. Translate technical to impact - numbers executives understand
  2. Present options - not just problems
  3. Quantify risk - probability × impact
  4. Recommend with ROI - justify investment
  5. Document decision - for accountability

Capacity isn't determined by engineering — it's chosen by business based on trade-offs that engineering clarifies.

Engineering's job is to give options. Executives' job is to choose between them.


This article is part of the series on the OCTOPUS Performance Engineering methodology.

OCTOPUSdecisionexecutiverisk

Want to understand your platform's limits?

Contact us for a performance assessment.

Contact Us