"The data shows we need to act." But do what, exactly? Performance data is abundant, but transforming it into decisions is a rare skill. This article teaches how to use data to guide choices, not just generate reports.
Data informs decisions. It doesn't make them.
The Gap Between Data and Decision
Data that doesn't become action
Common scenario:
Dashboard: "p99 latency = 2s"
Meeting: "Interesting"
Action: None
Why:
- Data not connected to impact
- No threshold defined
- Responsibility not clear
- Next step not obvious
Data that becomes action
Effective scenario:
Dashboard: "p99 = 2s (SLO: 1s) - Affects 5% of checkouts"
Meeting: "We're 2x above SLO"
Action: "DB optimization sprint"
Difference:
- Data connected to SLO
- Impact quantified
- Owner identified
- Clear action
Decision Framework
1. Connect data to objective
Raw data:
"CPU at 85%"
With context:
"CPU at 85% (target: <70% for headroom)
Risk: Next spike may cause degradation"
With decision:
"Options:
A) Scale now ($X/month)
B) Optimize code (Y sprints)
C) Accept risk (probability Z%)"
2. Define action thresholds
For each critical metric, define:
Checkout p95 Latency:
Green: < 500ms → No action
Yellow: 500-800ms → Investigate in 48h
Red: > 800ms → Immediate action
Critical: > 2s → Incident
Error Rate:
Green: < 0.5% → No action
Yellow: 0.5-1% → Investigate
Red: > 1% → Immediate action
Critical: > 5% → Incident
3. Map standard decisions
If [condition], then [action], owner [who]
Examples:
If p95 > SLO for 2 days:
→ Create investigation ticket
→ Owner: Tech Lead
→ Deadline: 5 business days
If CPU > 80% for 1 hour:
→ Alert to on-call
→ Evaluate auto-scaling
→ Owner: SRE
If error rate > 1% for 15 min:
→ Automatic incident
→ Owner: On-call
→ Communication: Slack #incidents
Types of Decisions
Operational decisions (minutes)
Trigger: Threshold alert
Data: Real-time metrics
Decider: On-call / automation
Examples:
- Auto-scale
- Activate circuit breaker
- Redirect traffic
- Rollback deploy
Framework:
If X, then Y automatically
If not resolved in Z minutes, escalate
Tactical decisions (days/weeks)
Trigger: Degradation trend / SLO gap
Data: Aggregated metrics, root cause analysis
Decider: Tech Lead / Engineering Manager
Examples:
- Prioritize optimization in sprint
- Add cache for endpoint
- Refactor problematic query
- Increase infra capacity
Framework:
Cost-benefit analysis
Backlog prioritization
Implementation timeline
Strategic decisions (months)
Trigger: Capacity planning / Roadmap
Data: Long-term trends, projections
Decider: VP Eng / CTO
Examples:
- Migrate to different architecture
- Invest in observability platform
- Hire SRE team
- Change cloud provider
Framework:
Business case with ROI
Alternatives analysis
Implementation roadmap
Translating Data for Stakeholders
For engineering
Detailed technical data:
- Latency percentiles by endpoint
- Time breakdown by component
- Resource utilization
- Identified correlations
Clear decision:
"Query X accounts for 40% of latency.
Adding index fixes in 2 hours.
Prioritize?"
For product
Experience data:
- Load time by feature
- Error rate by flow
- Impact on conversion funnel
Clear decision:
"Slow checkout causing 5% abandonment.
Investing 1 sprint improves conversion by ~1%.
Value: $X/month. Prioritize?"
For executives
Impact data:
- Revenue at risk
- Cost of inaction
- Investment ROI
Clear decision:
"Current capacity won't support Black Friday.
Option A: $50K to guarantee.
Option B: Risk $200K in lost sales.
Decision needed by [date]."
Documenting Decisions
Decision template
# Decision: [Title]
## Context
- Date: [when]
- Trigger: [what motivated]
- Data: [relevant metrics]
## Problem
[Clear description of the problem]
## Options Considered
### Option A: [Name]
- Description: [what to do]
- Cost: [time, money, effort]
- Benefit: [expected result]
- Risk: [what can go wrong]
### Option B: [Name]
[same structure]
## Decision
- Chosen: [which option]
- Reason: [why]
- Owner: [who executes]
- Deadline: [when]
## Success Metrics
- [How we'll know if it worked]
## Review
- Date: [when to reevaluate]
Decision record (ADR)
Maintain history of:
- Decisions made
- Context at the time
- Result obtained
- Lessons learned
Benefits:
- Avoid repeating mistakes
- Document reasoning
- Facilitate onboarding
- Create knowledge base
Decision-Making Pitfalls
1. Analysis paralysis
❌ "We need more data before deciding"
(While problem persists)
✅ "Current data supports decision X with 80% confidence.
Risk of waiting: Y. Deciding now."
2. Decision without data
❌ "Let's add cache because it always helps"
✅ "Data shows cache hit rate of 30%.
Improving to 80% would reduce latency by 40%.
Cost: 2 days. Deciding to implement."
3. Confirmation bias
❌ Seeking data that supports already-made decision
✅ Analyze data objectively, including
that which contradicts the hypothesis
4. Ignoring uncertainty
❌ "Data shows A is better"
(Without confidence interval)
✅ "Data suggests A is better (95% CI: 5-15% improvement).
Probability B is better: 20%."
Automating Decisions
When to automate
Automate if:
- Decision is frequent
- Criteria are clear
- Risk of error is low
- Speed is important
Examples:
- Autoscaling based on CPU
- Circuit breaker based on error rate
- Automatic rollback by metric
When to keep human
Keep human decision if:
- Context is complex
- Trade-offs are unclear
- Impact is high and irreversible
- Non-technical factors matter
Examples:
- Architecture change
- Infrastructure investment
- Roadmap prioritization
Conclusion
Data-driven decisions require:
- Connect data to objectives - not isolated numbers
- Define thresholds - when to act
- Map actions - what to do when
- Translate for audience - appropriate language
- Document decisions - create knowledge
Data is the beginning, not the end. The value is in the action it informs.
Data without decision is cost. Data with decision is investment.
This article is part of the series on the OCTOPUS Performance Engineering methodology.