I am creating a component that aims to calculate the average and variance of a metric associated with events happening during time but with a limited internal memory.
Imagine that the events are visitors entering in a shop and the metric is their age.
During time, my component receives events with the age of each visitor.
I don't want my component to memorize the history of each ages.
Ideally, I would like a light component storing only:
the average A
, the variance V and the number of events N
.
After each event with age E
, I want to update those three values :
N<=N+1 A<=(A*N+E)/(N+1) V<=???
What for V
? I am thinking of something like :
V<=(V*N+(E-A)^2)/(N+1)
I know it is not exact as my previous V
is using the old A
which is no more the average.
Q1 – Is there an exact formula?
Q2 – If not, is my proposal a good estimate? Is it biased? Will it converge correctly when N
increases?
Q3 – Is there a better formula?
Best Answer
Nice and simple algorithm for computing variance in online manner was described by Welford (1962). Below you can see C++/Rcpp implementation of it that works offline, but can be easily adapted to online scenario:
List welford_cpp(NumericVector x) { int n = x.length(); double delta; double msq = 0; double mean = x[0]; if (n > 1) { for (int i = 1; i < n; i++) { delta = x[i] - mean; mean += delta / (i+1); msq += delta * (x[i] - mean); } return Rcpp::List::create(Rcpp::Named("mean") = mean, Rcpp::Named("variance") = msq / (n-1)); } return Rcpp::List::create(Rcpp::Named("mean") = mean, Rcpp::Named("variance") = NAN); }
As you can see, it needs to store only four variables: n
, delta
, msq
and mean
and computes mean and variance simultaneously as you wanted.
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.