June 19, 2026 2 min read Tom Shafer

Anomaly detection with z-scores

Catching problems no threshold was set for — z-score anomaly detection over error rate, latency, and throughput, wired into the alert engine.

The last piece of this stretch: anomaly detection.

The gap thresholds leave

Static thresholds are great for the problems you anticipated. "Alert if error rate > 5%." But the nastiest incidents are the ones you didn't think to set a threshold for — a metric that's unusual for this app even if it's under every limit you configured. You can't write a rule for a problem you haven't imagined.

That's what anomaly detection is for: not "above 5%" but "way outside what's normal for you."

z-scores, kept simple

I built an AnomalyDetector that computes a z-score — how many standard deviations the current value sits from its recent baseline — over a handful of metrics: error rate, p95 latency, throughput, and error volume. When a metric strays far enough from its own normal, it's flagged into detected_anomalies and surfaced in the Quality section.

No machine-learning rabbit hole, no training pipeline. A z-score is boring, explainable statistics, and "boring and explainable" is exactly what you want for something that pages you at 3am. If it fires, I can say why in one sentence: "throughput is four standard deviations below normal."

Riding the engine again

Like SLO burn rate before it, anomaly detection raises an anomaly.detected event and the alert engine routes it — no new delivery code. That's three detectors now (security signals, SLO burn, anomalies) all reusing the same pluggable seam I built earlier in the sprint. The investment is paying off exactly as hoped.

Five dashboard sections, three new alert sources, watchers across three SDKs, privacy controls. Next, the capstone.

build-in-public anomaly-detection statistics alerting