Anomaly detection with z-scores
Catching problems no threshold was set for — z-score anomaly detection over error rate, latency, and throughput, wired into the alert engine.
The last piece of this stretch: anomaly detection.
The gap thresholds leave
Static thresholds are great for the problems you anticipated. "Alert if error rate > 5%." But the nastiest incidents are the ones you didn't think to set a threshold for — a metric that's unusual for this app even if it's under every limit you configured. You can't write a rule for a problem you haven't imagined.
That's what anomaly detection is for: not "above 5%" but "way outside what's normal for you."
z-scores, kept simple
I built an AnomalyDetector that computes a z-score — how many standard deviations the current value sits from its recent baseline — over a handful of metrics: error rate, p95 latency, throughput, and error volume. When a metric strays far enough from its own normal, it's flagged into detected_anomalies and surfaced in the Quality section.
No machine-learning rabbit hole, no training pipeline. A z-score is boring, explainable statistics, and "boring and explainable" is exactly what you want for something that pages you at 3am. If it fires, I can say why in one sentence: "throughput is four standard deviations below normal."
Riding the engine again
Like SLO burn rate before it, anomaly detection raises an anomaly.detected event and the alert engine routes it — no new delivery code. That's three detectors now (security signals, SLO burn, anomalies) all reusing the same pluggable seam I built earlier in the sprint. The investment is paying off exactly as hoped.
Five dashboard sections, three new alert sources, watchers across three SDKs, privacy controls. Next, the capstone.