Performance engineering isn't something you do once before launch. It's a continuous discipline, integrated into the development cycle, constantly monitored, and iteratively improved.
Performance is a marathon, not a sprint. Sustainable gains come from consistent practices, not heroic one-time optimizations.
The Traditional Model (and its problems)
Problematic cycle
1. Develop features (months)
2. "Performance week" before release
3. Discover serious problems
4. Rush to fix
5. Launch with debt
6. Fight fires in production
7. Repeat
Why it fails
- Problems discovered too late
- Architecture already solidified
- Deadline pressure
- Fixes are patches, not solutions
- Team exhausted from firefighting
The Continuous Model
Healthy cycle
Each commit:
→ Automated performance tests
→ Comparison with baseline
→ Quality gate
Each deploy:
→ Canary with metrics
→ A/B comparison
→ Automatic rollback if degrades
Constantly:
→ SLO monitoring
→ Proactive alerts
→ Production profiling
→ Trend review
Implementing Continuous Performance
1. Performance in CI/CD
# .github/workflows/perf.yml
name: Performance Pipeline
on: [push, pull_request]
jobs:
unit-perf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Benchmark Tests
run: |
pytest tests/benchmarks/ \
--benchmark-json=results.json
- name: Compare with Baseline
run: |
python scripts/compare.py \
--current results.json \
--baseline baseline.json \
--threshold 10
load-test:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy to Staging
run: ./deploy-staging.sh
- name: Load Test
run: |
k6 run tests/load.js \
--out json=load-results.json
- name: Validate SLOs
run: |
python scripts/validate-slos.py \
--results load-results.json
2. Canary Deployments
# Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 10m}
- analysis:
templates:
- templateName: latency-check
- setWeight: 20
- pause: {duration: 10m}
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
spec:
metrics:
- name: latency-p99
successCondition: result < 0.2
provider:
prometheus:
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{canary="true"}[5m]))
by (le)
)
3. SLO Monitoring
# SLO definitions
slos:
- name: API Availability
target: 99.9%
window: 30d
indicator:
good: sum(rate(http_requests_total{status=~"2.."}[5m]))
total: sum(rate(http_requests_total[5m]))
- name: API Latency
target: 95%
window: 30d
indicator:
good: |
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
total: sum(rate(http_requests_total[5m]))
# Burn rate based alerts
alerts:
- name: SLO Burn Rate High
expr: |
(
slo_error_budget_remaining /
slo_error_budget_total
) < 0.5
severity: warning
- name: SLO Burn Rate Critical
expr: |
(
slo_error_budget_remaining /
slo_error_budget_total
) < 0.2
severity: critical
4. Continuous Profiling
# Production profiling with sampling
import pyroscope
pyroscope.configure(
application_name="my-app",
server_address="http://pyroscope:4040",
# 10% of requests
sample_rate=0.1,
)
# Automatic profile
with pyroscope.tag_wrapper({
"endpoint": request.path,
"user_type": user.type
}):
return handle_request(request)
Daily Practices
Performance-focused Code Review
## PR Review Checklist
### Queries
- [ ] EXPLAIN run for new/modified queries
- [ ] No N+1 (check includes/prefetch)
- [ ] Appropriate indexes
### Algorithms
- [ ] O() complexity acceptable
- [ ] No loops with I/O
### Resources
- [ ] Connections/handlers are released
- [ ] Buffers have limits
- [ ] Timeouts configured
### Tests
- [ ] Benchmark test included if critical
- [ ] Load test updated if needed
Performance Budget
# Budget per endpoint
endpoints:
/api/products:
latency_p99: 100ms
throughput_min: 1000rps
/api/checkout:
latency_p99: 200ms
throughput_min: 500rps
/api/search:
latency_p99: 150ms
throughput_min: 2000rps
# Automated validation
on_budget_exceed:
- block_merge: true
- notify: "#perf-alerts"
- require_approval: "perf-team"
Weekly Performance Review
## Performance Review - Week 42
### Metrics vs SLOs
| SLO | Target | Actual | Status |
|-----|--------|--------|--------|
| Availability | 99.9% | 99.95% | ✅ |
| Latency p99 | 200ms | 185ms | ✅ |
| Error Rate | 0.1% | 0.08% | ✅ |
### Trends
- p99 latency increased 5% vs last week
- Throughput stable
- Memory usage trending up (investigate)
### Top Hotspots (via continuous profiling)
1. OrderService.calculateTax - 15% CPU
2. ProductRepository.search - 12% CPU
3. JSON serialization - 8% CPU
### Actions
- [ ] Investigate memory trend
- [ ] Optimize calculateTax
- [ ] Index review for search
### Incidents
- No performance incidents
Ecosystem Tools
Pipeline
CI/CD:
- GitHub Actions / GitLab CI
- Argo CD / Flux
Load Testing:
- k6
- Gatling
- Locust
Canary:
- Argo Rollouts
- Flagger
- Spinnaker
Observability
Metrics:
- Prometheus
- Datadog
- New Relic
Tracing:
- Jaeger
- Tempo
- Zipkin
Profiling:
- Pyroscope
- Datadog Continuous Profiler
- Parca
Dashboards:
- Grafana
- Datadog
Analysis
SLO Management:
- Nobl9
- Datadog SLOs
- Google SLO Generator
Alerting:
- Prometheus Alertmanager
- PagerDuty
- Opsgenie
Program Success Metrics
Program Maturity:
Level 1 - Ad-hoc:
- No performance tests
- Problems discovered in production
- Reactive
Level 2 - Basic:
- Manual tests before releases
- Basic monitoring
- Alerts exist
Level 3 - Integrated:
- Automated tests in CI
- SLOs defined
- Canary deployments
Level 4 - Proactive:
- Continuous profiling
- Performance budgets
- Trend analysis
Level 5 - Optimized:
- Performance culture across entire team
- Self-service process
- Measurable continuous improvement
Program KPIs
Process:
- % of PRs with perf test: > 90%
- Lead time for perf fix: < 1 day
- Regressions detected before prod: > 95%
Quality:
- SLO compliance: > 99%
- Perf incidents/month: < 1
- Error budget used: < 50%
Efficiency:
- Cost per request: downward trend
- Throughput per $: upward trend
- Time spent on perf work: < 20%
Conclusion
Continuous performance engineering transforms performance from:
Problem → Discipline
Reactive → Proactive
Event → Process
Individual → Entire team
To get started:
- Week 1: Add basic metrics
- Week 2: Define initial SLOs
- Month 1: Integrate tests in CI
- Month 2: Implement canary
- Month 3: Continuous profiling
- Ongoing: Refine and improve
The goal isn't immediate perfection, but measurable continuous improvement.
Continuous performance engineering is like hygiene: it's not a special event, it's a daily practice.