Stress Testing: discovering your system's limits

Load testing validates if the system handles expected load. Stress testing discovers where it breaks. Different objectives, different techniques, different insights. This article explains when and how to stress your system in a controlled way.

Load testing asks "can it handle it?". Stress testing asks "how much can it handle?".

The Difference Between Load and Stress

Load Testing

Objective: Validate behavior under expected load
Load: Normal to predicted peak
Duration: Hours (steady state)
Result: Pass/Fail on SLOs

Example:
  - Load: 1000 req/s (expected peak)
  - Duration: 2 hours
  - Criterion: p95 < 500ms, error rate < 1%

Stress Testing

Objective: Find limits and failure points
Load: Above expected, increasing
Duration: Until degradation or failure
Result: Maximum capacity, failure mode

Example:
  - Load: 1000 → 2000 → 3000 → ... req/s
  - Duration: Until first bottleneck
  - Result: "System handles 2500 req/s,
             fails at connection pool at 2800"

Why Do Stress Testing

1. Know the real capacity

Without stress test:
  "We think it handles 1000 users"

With stress test:
  "We validated it handles 3200 users,
   bottleneck is DB connections,
   overload behavior: graceful degradation"

2. Understand the failure mode

Questions stress test answers:
  - What fails first?
  - Fails gradually or catastrophically?
  - Recovers automatically?
  - How long to recover?

3. Validate protection mechanisms

Tests if they work:
  - Rate limiting
  - Circuit breakers
  - Autoscaling
  - Graceful degradation
  - Queue backpressure

4. Prepare for the unexpected

Events that exceed prediction:
  - Unexpected viral campaign
  - TV/media mention
  - Bot flood (accidental or intentional)
  - Outage recovery (thundering herd)

Types of Stress Test

1. Step-Up Stress

Profile:
  ┌─────────────────────────────────┐
  │ Load                            │
  │    ▲                   ┌───┐    │
  │    │             ┌─────┘   │    │
  │    │       ┌─────┘         │    │
  │    │ ┌─────┘               │    │
  │    └─┴─────────────────────┴──▶ │
  │                            Time │
  └─────────────────────────────────┘

Implementation (k6):
  stages: [
    { duration: '10m', target: 1000 },  // Step 1
    { duration: '10m', target: 2000 },  // Step 2
    { duration: '10m', target: 3000 },  // Step 3
    { duration: '10m', target: 4000 },  // Step 4
  ]

Use: Find degradation point gradually

2. Spike Test

Profile:
  ┌─────────────────────────────────┐
  │ Load                            │
  │    ▲        ┌───┐               │
  │    │        │   │               │
  │    │        │   │               │
  │    │ ───────┘   └───────────    │
  │    └────────────────────────▶   │
  │                            Time │
  └─────────────────────────────────┘

Implementation (k6):
  stages: [
    { duration: '5m', target: 500 },   // Normal
    { duration: '1m', target: 5000 },  // Spike
    { duration: '5m', target: 5000 },  // Sustain
    { duration: '1m', target: 500 },   // Drop
    { duration: '10m', target: 500 },  // Recovery
  ]

Use: Validate response to sudden spike and recovery

3. Sustained Overload

Profile:
  ┌─────────────────────────────────┐
  │ Load                            │
  │    ▲                            │
  │    │ ┌─────────────────────┐    │
  │    │ │                     │    │
  │    │ │  Above capacity     │    │
  │    └─┴─────────────────────┴──▶ │
  │                            Time │
  └─────────────────────────────────┘

Implementation:
  stages: [
    { duration: '5m', target: 3000 },  // Ramp
    { duration: '60m', target: 3000 }, // Sustain overload
  ]

Use: Observe prolonged degradation, memory leaks

4. Breaking Point (Destructive)

Profile:
  ┌─────────────────────────────────┐
  │ Load                            │
  │    ▲                      ╱     │
  │    │                    ╱       │
  │    │                  ╱         │
  │    │                ╱           │
  │    └──────────────╱───────────▶ │
  │                            Time │
  └─────────────────────────────────┘

Implementation:
  stages: [
    { duration: '60m', target: 10000 }, // Continuous ramp
  ]

Use: Find absolute limit (until OOM, timeout, crash)

Metrics During Stress Test

What to observe

Performance:
  - Latency (p50, p95, p99)
  - Effective throughput
  - Error rate
  - Timeout rate

Resources:
  - CPU (all pods/instances)
  - Memory (and GC if applicable)
  - Connections (DB, cache, external)
  - I/O (disk, network)

Application:
  - Queue depths
  - Thread pool usage
  - Connection pool usage
  - Active requests

System:
  - Pod/instance health
  - Autoscaling events
  - Circuit breaker states

Identifying the bottleneck

Symptoms by bottleneck type:

CPU-bound:
  - CPU at 100%
  - Latency rises linearly
  - Throughput plateaus

Memory-bound:
  - Memory growing
  - Frequent/long GC
  - Eventual OOM

Connection-bound:
  - Pool at 100%
  - Timeouts increasing
  - Requests queued

I/O-bound:
  - Low CPU
  - High disk IOPS
  - Network saturated

Executing Stress Test

Preparation

Checklist:
  - [ ] Isolated environment (doesn't affect production)
  - [ ] Complete monitoring active
  - [ ] Production alerts silenced
  - [ ] Team aware (SRE, infra)
  - [ ] Rollback plan (if in shared staging)
  - [ ] Stop criteria defined

During the test

Monitor in real-time:
  - Metrics dashboard
  - Error logs
  - Pod status
  - Autoscaling

Document:
  - Event timestamps
  - First sign of degradation
  - Behavior under stress
  - Observed errors

Stop criteria

Stop when:
  - Error rate > 50%
  - Latency > 30s
  - OOM detected
  - Critical component down
  - Data corrupted

Don't stop just for:
  - Gradual degradation
  - Moderate error rate
  - High latency but responsive

Interpreting Results

Saturation curve

           ┌─────────────────────────────────┐
           │                                 │
 Latency   │             ╱╱╱╱╱               │
    or     │           ╱╱                    │
 Error %   │         ╱                       │
           │       ╱                         │
           │  ────╱   "Knee" (inflection     │
           │          point)                 │
           └─────────────────────────────────┘
                    Throughput →

Green zone: Stable performance
Knee: Saturation start
Red zone: Rapid degradation

Documenting results

## Stress Test Report - 2024-01-20

### Configuration
- Environment: Staging (3x prod)
- Baseline: 1000 req/s
- Test: Step-up until failure

### Results

| Load | p95 | Error % | Observation |
|------|-----|---------|-------------|
| 1000 | 120ms | 0.1% | Normal |
| 1500 | 150ms | 0.2% | OK |
| 2000 | 200ms | 0.5% | OK |
| 2500 | 350ms | 1.2% | Degradation start |
| 3000 | 800ms | 5% | Visible degradation |
| 3500 | 2s | 15% | Severe |
| 4000 | Timeout | 40% | Failure |

### Analysis
- Maximum sustainable capacity: 2000 req/s
- Knee point: 2500 req/s
- Primary bottleneck: DB connection pool
- Failure mode: Graceful degradation until 3500,
                then timeout cascade

### Recommendations
1. Increase connection pool from 50 to 100
2. Implement circuit breaker for DB
3. Consider read replica for read queries
4. Retest after adjustments

Stress Testing in CI/CD

When to include

Not on every PR:
  - Too slow
  - Expensive resources

Include on:
  - Production releases
  - Infra changes
  - New critical endpoints
  - Connection handling changes

Nightly:
  - Complete stress test
  - Report for morning review

Automation

# Pipeline example
stress_test:
  stage: test
  only:
    - main
    - /^release-.*/
  script:
    - k6 run stress-test.js
    - python analyze_results.py
  artifacts:
    paths:
      - stress-report.html
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"

Conclusion

Stress testing is essential to:

Know limits - don't guess, know
Understand failures - how, where, when
Validate protections - do circuit breakers work?
Prepare for incidents - know what to expect

The difference between a resilient system and a fragile one is knowing where the limits are before finding them in production.

Every system breaks under enough pressure. The question is: do you know where and how?

This article is part of the series on the OCTOPUS Performance Engineering methodology.