SLOs and error budgets
Defining service level objectives and tracking attainment, error budget, and burn rate from existing uptime checks and request traces.
Next: SLOs and error budgets, in a new Quality section.
"Is the app healthy?" is a vibe. An SLO turns it into a number you can argue with: 99.9% of requests succeed, measured over 30 days. Below that, you've spent your error budget and it's time to slow down and stabilize. Above it, you have budget to spend on shipping.
What I built
ServiceLevelObjective— define an objective: an SLI (uptime availability, HTTP availability, or HTTP latency under a threshold), a target percentage, and a window.SloAttainmentEvaluator— computes attainment, error budget remaining, and burn rate from data already on hand: synthetic uptime checks and inbound request traces. No new collection — it reuses the uptime monitors and thehttp.serverspans.- Burn-rate alerting — when you're burning budget fast enough to blow the window, it fires through the alert engine (a new
slo.burn_rateevent — and notice the engine needed zero changes to support it).
Error budget is the useful part
Attainment is a scoreboard. The error budget is a decision tool. "We're at 99.95% against a 99.9% target with 60% of the budget left" tells a team they can ship. "Budget's gone and we're four days into the window" tells them to freeze and fix. It reframes reliability from a binary (up/down) into a resource you manage — which is the whole point of SRE.
Two more pieces rounded out this stretch: PII redaction and retention and anomaly detection.