I'm trying to build a daily alert system to let me know when something unusual has happened in my analytics data that might require further investigation.
Right now I'm taking 7 weeks worth of data, creating an average for each day of the week (ie Average Conversion Rate for Monday). I then compare how much this Monday has changed from the regular Monday average in terms of standard deviations. If it's +/-2 stdevs, then it warrants investigation.
A few concerns I have :
How many weeks should I take into account? My concern is that if I take too many then I might have a trend emerging that skews my alerts. If I take too few, I might have high variance and the system might fire alerts needlessly.
Would it be better to take a rolling average for say last 7 Mondays and compare this Monday to that rather than the mean?
What would the best way be to deal with seasonality – ie bank holiday Mondays which happen occasionally but would skew averages and standard deviations.
Some websites might have little traffic per day. How should I deal with them? Would I have to take a longer time frame to calculate averages? How would I decide how many to take?
Best Answer
You seem a bit confused about the differences between type I and type II errors.
First off, a critical value of +/- 2 stdevs means that (for normal data), you will be sending false positive alerts with a 1/20 chance. Isn't that going to be annoying considering that, over a year, you'd expect to get an average of 2.6 incorrect weekly warnings per year when the data are perfectly regular? I'd consider specifying a more conservative false positive error rate.
When you take too few observations, your estimate of the variance is not "too high", it's just unstable (unless, you mean to say that you're using the standard error of the mean of the data and not the standard deviation). "Unstable" means it [the SD estimate] could be too high, it could be too low. Even if the estimated standard deviation (or standard error) were too large, you would fire alerts not often enough (conservative) and not too often (anticonservative).
The real question here is whether or not "unusual" in this circumstance means "data are consistent with what the data usually are" or "data are consistent with some pre-identified target value". The latter warrants use of a control chart which is an effective graphical tool for displaying temporal variability and trends in a controlled data process. The former is a time when one can use an ARMA (autoregressive moving average) model to roll forward previously observed data and infer whether present values either are or are not consistent with the historical values. The asymptotic theory of these models easily extend a decision rule about sending alerts based on the normal distributions of model parameters.
If certain days are heterogeneous, you will want to turn off your alert system because it makes sense to do so.
With small sample sizes, you should consider parametric probability models for website hits. Instead of estimating an independent mean/variance process, you should consider some counting process models like poisson or negative binomial for the mean/variance process. That way, you would need fewer observations to infer whether a specific day was abnormally high/low in hit rate. These, of course, have a lot more assumptions embedded in them.
Similar Posts:
- Solved – What’s the best (Google chart) visualisation for displaying sparse timeline data across thousands of “columns”
- Solved – Aggregating standard deviation to a summary point
- Solved – How to interpret the variance of time series data using the average growth rates
- Solved – Bayesian rating system with multiple categories for each rating
- Solved – Bayesian rating system with multiple categories for each rating