How to metric. This article will teach you how my… | by Dave Smith | Medium
Stop using averages for metrics - they're lying to you. This engineering director explains why percentiles, adaptive thresholds, and automated reviews are the only way to actually understand what's happening in your systems.
Read Original Summary used for search
TLDR
• Averages are misleading because they're skewed by outliers and don't represent actual user experiences - use P50, P90, P99 percentiles instead to see real data points
• Every alarm needs grace periods (time-based or sample-based) to avoid spurious alerts from temporary spikes that self-recover
• Build automated tools to review and propose threshold changes weekly - as your software evolves, your thresholds must adapt or they become useless
• The absence of a metric is as critical as a threshold breach - alarm when metrics stop reporting
• Comprehensive list of what to actually measure for web apps: latency percentiles per URL, error rates by response code, cache performance, host metrics like load average, service I/O operations
In Detail
The fundamental problem with software metrics is that most teams use averages, which systematically mislead. When you average load times of 62ms, 920ms, 37ms, 20ms, 850ms, and 45ms, you get 322ms - but no user actually experienced that. Some had blazing fast loads under 70ms, others suffered through 850ms+. Percentiles solve this: P50 gives you the median (62ms), P90 shows the worst 10% experience (920ms), and every data point represents an actual user. This applies beyond latency - for memory across 50 servers, P0/P50/P100 shows min/median/max usage and reveals how consistently your app behaves.
Alarms are useless without grace periods. If your CPU spikes to 90% during cron jobs but recovers in seconds, a flat 50% threshold will page you at 3am for nothing. Time-based grace periods ensure metrics stay in alarm territory before firing. Better systems let you set different grace periods for entering and exiting alarm states. Adaptive thresholds go further - using historical data to vary thresholds by time of day or day of week, catching unusual nighttime traffic spikes that flat thresholds would miss. The author's team built tools that analyze metrics weekly and propose threshold adjustments automatically, preventing thresholds from becoming stale as software evolves.
For web applications specifically, measure: P50/P90/P99 latency per URL pattern, response codes (2xx/3xx/4xx/5xx) as both counts and rates, latency breakdown (app vs database vs downstream services), request types (GET/POST/DELETE), application-specific load times using the performance API, payload sizes, gzip compression ratios, load balancer spillovers, cache hit/miss rates, and critical host metrics including load average (which most teams ignore but reveals how well hardware keeps up with software demands). Also track service I/O operations - AWS services like DynamoDB throttle you, causing mysterious slowdowns. The key insight: outsource your metrics infrastructure entirely to tools like statsd, Graphite, or Librato rather than building your own.