Breaking Point: where and how your system fails

Every system has a limit. The question isn't "if" it will break, but "where" and "how". Knowing the breaking point allows you to prepare the system to fail gracefully, not catastrophically. This article teaches you to find, document, and use this information.

Systems don't fail randomly. They fail at specific points, in predictable ways.

What Is a Breaking Point

Definition

Breaking point:
  The load or condition where the system stops
  meeting minimum functioning criteria

Can be defined by:
  - Error rate > X%
  - Latency > X seconds
  - Throughput drops below X
  - Critical component fails
  - Data corrupted

Types of breaking

Soft breaking point:
  - Gradual degradation
  - Some requests fail
  - System partially functional
  - Recovery possible without intervention

Hard breaking point:
  - Catastrophic failure
  - System unresponsive
  - Requires manual intervention
  - Possible data loss

Finding the Breaking Point

Progressive testing

Method:
  1. Start at normal load
  2. Increment gradually
  3. Observe metrics continuously
  4. Identify first sign of degradation
  5. Continue until failure
  6. Document each threshold

Example:
  Step 1: 1000 req/s → OK
  Step 2: 1500 req/s → OK
  Step 3: 2000 req/s → Degradation (p95 rises)
  Step 4: 2500 req/s → Error rate 5%
  Step 5: 3000 req/s → Error rate 30%
  Step 6: 3500 req/s → Generalized timeout

Identifying components

For each step, identify:
  - Which component saturated?
  - Which metric indicated first?
  - How did it affect other components?

Example:
  At 2500 req/s:
    - DB connection pool: 100%
    - Requests queue up
    - Timeout in service A
    - Service B receives timeout from A
    - Cascade of failures

Failure Modes

1. Graceful Degradation

Behavior:
  - Performance reduces gradually
  - Non-critical features stop
  - Core continues working
  - Recovers automatically when load drops

Example:
  "Under 3x load, search is slow (5s),
   but checkout continues working (1s)"

Implementation:
  - Request prioritization
  - Non-critical load shedding
  - Circuit breakers

2. Graceful Failure

Behavior:
  - System recognizes overload
  - Refuses new requests (503)
  - Completes in-progress requests
  - Doesn't corrupt data

Example:
  "Above 5000 req/s, returns 503 for
   new requests but completes current ones"

Implementation:
  - Rate limiting
  - Admission control
  - Queue limits

3. Cascading Failure

Behavior:
  - One component fails
  - Dependents become overloaded
  - Failure propagates
  - Entire system goes down

Example:
  "DB gets slow → API timeout →
   Clients retry → DB worse →
   All services fail"

Prevention:
  - Circuit breakers
  - Aggressive timeouts
  - Retry with backoff
  - Failure isolation

4. Byzantine Failure

Behavior:
  - Inconsistent behavior
  - Partially correct responses
  - Hard to detect
  - Possible data corruption

Example:
  "Under stress, cache returns stale
   data mixed with current"

Prevention:
  - Data validation
  - Deep health checks
  - Strict timeouts

Documenting Breaking Points

Documentation template

# Breaking Point Analysis - System XYZ

## Executive Summary
- Maximum sustainable capacity: 2000 req/s
- Soft breaking point: 2500 req/s
- Hard breaking point: 3500 req/s
- First bottleneck: DB connection pool

## Breakdown by Load

### 2000 req/s (Maximum Sustainable Capacity)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | 250ms | ✓ OK |
| Error Rate | 0.3% | ✓ OK |
| CPU | 75% | ✓ OK |
| DB Connections | 80% | ⚠ Warning |

**Observation**: Stable operation, limited headroom

### 2500 req/s (Soft Breaking Point)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | 800ms | ⚠ Degraded |
| Error Rate | 3% | ⚠ Degraded |
| CPU | 85% | ✓ OK |
| DB Connections | 100% | ❌ Saturated |

**Observation**: DB connection pool saturated.
Requests queue up. System still functional.

### 3500 req/s (Hard Breaking Point)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | >30s | ❌ Failed |
| Error Rate | 45% | ❌ Failed |
| CPU | 95% | ⚠ High |
| DB Connections | 100% | ❌ Saturated |

**Observation**: Timeout cascade.
System non-functional for most users.

## Failure Sequence
1. DB connection pool reaches 100%
2. Requests wait for connection (latency rises)
3. Timeouts start after 5s
4. Clients retry (load increases)
5. More timeouts, more retries
6. Cascade effect until generalized failure

## Critical Component
- **Component**: PostgreSQL connection pool
- **Current limit**: 100 connections
- **Suggestion**: Increase to 200 or implement PgBouncer

## Observed Failure Mode
- **Type**: Cascading failure
- **Recovery**: Manual (restart needed above 4000 req/s)
- **Recovery time**: ~5 minutes after load reduction

## Recommendations
1. Implement circuit breaker for DB
2. Increase connection pool or use pooler
3. Add rate limiting at 2200 req/s
4. Alert on DB connections > 70%

Using the Information

For capacity planning

Given:
  - Soft breaking point: 2500 req/s
  - Expected growth: 50% per year
  - Current load: 1000 req/s

Calculation:
  - Current headroom: 2.5x
  - In 1 year: 1500 req/s (1.67x headroom)
  - In 2 years: 2250 req/s (1.1x headroom) ⚠

Action:
  Plan scaling or optimization before 18 months

For alerts

Alerts based on breaking point:

Warning (70% of soft breaking):
  - DB connections > 70%
  - p95 > 200ms
  - CPU > 60%

Critical (90% of soft breaking):
  - DB connections > 90%
  - p95 > 500ms
  - Error rate > 1%

Emergency (hard breaking imminent):
  - DB connections = 100%
  - p95 > 2s
  - Error rate > 10%

For runbooks

## Runbook: System under stress

### Indicators
- Error rate above 2%
- Latency p95 above 1s
- DB connections above 90%

### Immediate Actions
1. Check traffic (real spike or attack?)
2. If legitimate spike:
   - Activate rate limiting
   - Scale horizontally if possible
3. If attack:
   - Activate WAF rules
   - Block suspicious IPs

### Mitigation
- Disable non-critical features
- Reduce batch sizes
- Increase timeout (carefully)

### Escalation
- If not resolved in 10 min → On-call SRE
- If degradation > 50% → Incident commander

Preventing Catastrophic Failures

Protection mechanisms

Rate Limiting:
  - Limit requests per client
  - Prioritize by request type
  - Configure based on 80% of capacity

Circuit Breakers:
  - Per external dependency
  - Fail fast when dependency degraded
  - Auto-reset after period

Load Shedding:
  - Reject requests when saturated
  - Prioritize critical requests
  - Return 503 with Retry-After

Bulkheads:
  - Isolate critical components
  - Separate pools by function
  - Prevent failure propagation

Conclusion

Knowing the breaking point allows you to:

Plan capacity - when to scale before breaking
Configure alerts - warn before degradation
Prepare runbooks - know what to do
Implement protection - fail gracefully
Communicate risks - inform stakeholders

Every system has limits. The difference is in knowing them and being prepared.

It's not a matter of "if" it will break, but "when" and "how". Be prepared for both.

This article is part of the series on the OCTOPUS Performance Engineering methodology.