The Incidents page: spikes, releases, and correlation
Detecting error spikes and correlating them with release markers and affected traces, so an incident reads as a story instead of a list.
Next: an Incidents page in the Errors section.
A raw error list answers "what's broken." An incident view answers the better question: "what happened?" Those are different. The first is a feed; the second is a narrative with a beginning (a spike), a probable cause (a release), and a blast radius (the traces and queries it touched).
What it does
- Spike detection — find the moments when error volume jumped, not just the steady background hum. A spike is the start of a story.
- Release correlation — line spikes up against the release markers. When a spike begins minutes after 2.4.0 ships, the page puts those two facts next to each other and lets you draw the obvious conclusion.
- Affected traces & slow queries — from a spike, jump straight to the requests and queries caught in it. Detection to diagnosis in one hop.
Why correlation beats detection
Anyone can detect a spike — it's a count crossing a line. The value is in the correlation: "errors spiked at 09:12, and 2.4.0 deployed at 09:07." That single juxtaposition turns a frantic "everything's on fire, where do I even look" into "roll back 2.4.0 and breathe." The page exists to manufacture that juxtaposition automatically.
It only works because the spine was already in place — releases are first-class, and traces are linkable. Each feature this sprint makes the next one cheaper.
Next: putting numbers on reliability with SLOs and error budgets.