Most status pages report uptime as a single proud number: 99.95% last month. It looks like a grade. It isn't one — it's a budget statement, and reading it as a grade is why so many teams either over-invest in reliability they don't need or get blindsided by the outage they didn't see coming.
The fix is small and changes how a whole team makes decisions: stop reporting the uptime you achieved, and start spending the error budget you're allowed.
An error budget is just the gap to 100%
Pick a target — a Service Level Objective. Say 99.9% of requests should succeed over 30 days. The error budget is everything left over:
error budget = 1 − SLO = 0.1% ≈ 43 minutes of downtime per 30 days
That 0.1% is not a failure waiting to happen. It's a resource you're meant to spend — on risky deploys, migrations, a chaos experiment, a Friday release. A month where you used none of your error budget doesn't mean you're winning; it often means you're shipping too slowly and your target is too low.
Translate your own target into real minutes — and a request budget — with the SLO & Error Budget Calculator. "Five nines" stops sounding aspirational the moment you see it's 26 seconds a month.
The burn rate tells you when to stop
The budget by itself is static. What makes it operational is the burn rate: how fast you're spending it relative to the window.
- Burning at 1× means you'll spend exactly the month's budget over the month. Fine. That's the budget doing its job.
- Burning at 14.4× means you'll exhaust a 30-day budget in about two days. That's the classic threshold for a fast-burn page — something is actively on fire.
This gives you a policy that writes itself, with no debate in the moment:
- Budget remaining? Ship. Take the risk. That's what it's for.
- Budget spent? Freeze feature work and spend the next cycle on reliability until you're back in the black.
The argument about "should we slow down and fix things" stops being a matter of opinion or seniority. The budget already answered it.
Why this beats an uptime report
A reliability conversation built on uptime percentages drifts toward vibes — "feels stable lately," "we had a rough week." A conversation built on an error budget is concrete and forward-looking:
- It sets an explicit, agreed target instead of an implicit "as close to 100% as possible," which is both impossible and ruinously expensive.
- It makes the cost of unreliability visible before the incident, not after.
- It gives product and engineering a shared number to plan against, so "move fast" and "keep it up" stop being in permanent tension.
And it scales down cleanly. You don't need a platform team or a tracing stack to start — you need one SLO that maps to real user pain, a way to measure it, and the discipline to act on the budget when it runs low.
Where teams get it wrong
- Targeting 100%. There's no budget at 100%, so there's no room to ship and no signal when you're in trouble. Pick a number you can actually defend.
- An SLO nobody acts on. A budget you never enforce is just a dashboard. The value is entirely in the policy — freeze when it's gone.
- Measuring the wrong thing. "The server was up" isn't the same as "users got a fast, correct response." Tie the SLO to the request outcome users feel.
Reliability isn't a number you brag about at the end of the month. It's a budget you spend deliberately all month long — and the teams that treat it that way ship faster and break less, because they finally agree on what "enough" means.
Once you've set a target, put a price on missing it: the Downtime Cost Calculator turns the minutes you're allowed into the dollars at stake — which is usually what gets the reliability work funded.