Every system has a limit. The question isn't "if" it will break, but "where" and "how". Knowing the breaking point allows you to prepare the system to fail gracefully, not catastrophically. This article teaches you to find, document, and use this information.
Systems don't fail randomly. They fail at specific points, in predictable ways.
What Is a Breaking Point
Definition
Breaking point:
The load or condition where the system stops
meeting minimum functioning criteria
Can be defined by:
- Error rate > X%
- Latency > X seconds
- Throughput drops below X
- Critical component fails
- Data corrupted
Types of breaking
Soft breaking point:
- Gradual degradation
- Some requests fail
- System partially functional
- Recovery possible without intervention
Hard breaking point:
- Catastrophic failure
- System unresponsive
- Requires manual intervention
- Possible data loss
Finding the Breaking Point
Progressive testing
Method:
1. Start at normal load
2. Increment gradually
3. Observe metrics continuously
4. Identify first sign of degradation
5. Continue until failure
6. Document each threshold
Example:
Step 1: 1000 req/s → OK
Step 2: 1500 req/s → OK
Step 3: 2000 req/s → Degradation (p95 rises)
Step 4: 2500 req/s → Error rate 5%
Step 5: 3000 req/s → Error rate 30%
Step 6: 3500 req/s → Generalized timeout
Identifying components
For each step, identify:
- Which component saturated?
- Which metric indicated first?
- How did it affect other components?
Example:
At 2500 req/s:
- DB connection pool: 100%
- Requests queue up
- Timeout in service A
- Service B receives timeout from A
- Cascade of failures
Failure Modes
1. Graceful Degradation
Behavior:
- Performance reduces gradually
- Non-critical features stop
- Core continues working
- Recovers automatically when load drops
Example:
"Under 3x load, search is slow (5s),
but checkout continues working (1s)"
Implementation:
- Request prioritization
- Non-critical load shedding
- Circuit breakers
2. Graceful Failure
Behavior:
- System recognizes overload
- Refuses new requests (503)
- Completes in-progress requests
- Doesn't corrupt data
Example:
"Above 5000 req/s, returns 503 for
new requests but completes current ones"
Implementation:
- Rate limiting
- Admission control
- Queue limits
3. Cascading Failure
Behavior:
- One component fails
- Dependents become overloaded
- Failure propagates
- Entire system goes down
Example:
"DB gets slow → API timeout →
Clients retry → DB worse →
All services fail"
Prevention:
- Circuit breakers
- Aggressive timeouts
- Retry with backoff
- Failure isolation
4. Byzantine Failure
Behavior:
- Inconsistent behavior
- Partially correct responses
- Hard to detect
- Possible data corruption
Example:
"Under stress, cache returns stale
data mixed with current"
Prevention:
- Data validation
- Deep health checks
- Strict timeouts
Documenting Breaking Points
Documentation template
# Breaking Point Analysis - System XYZ
## Executive Summary
- Maximum sustainable capacity: 2000 req/s
- Soft breaking point: 2500 req/s
- Hard breaking point: 3500 req/s
- First bottleneck: DB connection pool
## Breakdown by Load
### 2000 req/s (Maximum Sustainable Capacity)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | 250ms | ✓ OK |
| Error Rate | 0.3% | ✓ OK |
| CPU | 75% | ✓ OK |
| DB Connections | 80% | ⚠ Warning |
**Observation**: Stable operation, limited headroom
### 2500 req/s (Soft Breaking Point)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | 800ms | ⚠ Degraded |
| Error Rate | 3% | ⚠ Degraded |
| CPU | 85% | ✓ OK |
| DB Connections | 100% | ❌ Saturated |
**Observation**: DB connection pool saturated.
Requests queue up. System still functional.
### 3500 req/s (Hard Breaking Point)
| Metric | Value | Status |
|--------|-------|--------|
| p95 Latency | >30s | ❌ Failed |
| Error Rate | 45% | ❌ Failed |
| CPU | 95% | ⚠ High |
| DB Connections | 100% | ❌ Saturated |
**Observation**: Timeout cascade.
System non-functional for most users.
## Failure Sequence
1. DB connection pool reaches 100%
2. Requests wait for connection (latency rises)
3. Timeouts start after 5s
4. Clients retry (load increases)
5. More timeouts, more retries
6. Cascade effect until generalized failure
## Critical Component
- **Component**: PostgreSQL connection pool
- **Current limit**: 100 connections
- **Suggestion**: Increase to 200 or implement PgBouncer
## Observed Failure Mode
- **Type**: Cascading failure
- **Recovery**: Manual (restart needed above 4000 req/s)
- **Recovery time**: ~5 minutes after load reduction
## Recommendations
1. Implement circuit breaker for DB
2. Increase connection pool or use pooler
3. Add rate limiting at 2200 req/s
4. Alert on DB connections > 70%
Using the Information
For capacity planning
Given:
- Soft breaking point: 2500 req/s
- Expected growth: 50% per year
- Current load: 1000 req/s
Calculation:
- Current headroom: 2.5x
- In 1 year: 1500 req/s (1.67x headroom)
- In 2 years: 2250 req/s (1.1x headroom) ⚠
Action:
Plan scaling or optimization before 18 months
For alerts
Alerts based on breaking point:
Warning (70% of soft breaking):
- DB connections > 70%
- p95 > 200ms
- CPU > 60%
Critical (90% of soft breaking):
- DB connections > 90%
- p95 > 500ms
- Error rate > 1%
Emergency (hard breaking imminent):
- DB connections = 100%
- p95 > 2s
- Error rate > 10%
For runbooks
## Runbook: System under stress
### Indicators
- Error rate above 2%
- Latency p95 above 1s
- DB connections above 90%
### Immediate Actions
1. Check traffic (real spike or attack?)
2. If legitimate spike:
- Activate rate limiting
- Scale horizontally if possible
3. If attack:
- Activate WAF rules
- Block suspicious IPs
### Mitigation
- Disable non-critical features
- Reduce batch sizes
- Increase timeout (carefully)
### Escalation
- If not resolved in 10 min → On-call SRE
- If degradation > 50% → Incident commander
Preventing Catastrophic Failures
Protection mechanisms
Rate Limiting:
- Limit requests per client
- Prioritize by request type
- Configure based on 80% of capacity
Circuit Breakers:
- Per external dependency
- Fail fast when dependency degraded
- Auto-reset after period
Load Shedding:
- Reject requests when saturated
- Prioritize critical requests
- Return 503 with Retry-After
Bulkheads:
- Isolate critical components
- Separate pools by function
- Prevent failure propagation
Conclusion
Knowing the breaking point allows you to:
- Plan capacity - when to scale before breaking
- Configure alerts - warn before degradation
- Prepare runbooks - know what to do
- Implement protection - fail gracefully
- Communicate risks - inform stakeholders
Every system has limits. The difference is in knowing them and being prepared.
It's not a matter of "if" it will break, but "when" and "how". Be prepared for both.
This article is part of the series on the OCTOPUS Performance Engineering methodology.