Solved – Online estimation of variance with limited memory

I am creating a component that aims to calculate the average and variance of a metric associated with events happening during time but with a limited internal memory.

Imagine that the events are visitors entering in a shop and the metric is their age.

During time, my component receives events with the age of each visitor.
I don't want my component to memorize the history of each ages.
Ideally, I would like a light component storing only:
the average A, the variance V and the number of events N.

After each event with age E, I want to update those three values :

N<=N+1 A<=(A*N+E)/(N+1) V<=??? 

What for V? I am thinking of something like :


I know it is not exact as my previous V is using the old A which is no more the average.

Q1 – Is there an exact formula?
Q2 – If not, is my proposal a good estimate? Is it biased? Will it converge correctly when N increases?
Q3 – Is there a better formula?

Nice and simple algorithm for computing variance in online manner was described by Welford (1962). Below you can see C++/Rcpp implementation of it that works offline, but can be easily adapted to online scenario:

List welford_cpp(NumericVector x) {    int n = x.length();   double delta;   double msq = 0;   double mean = x[0];    if (n > 1) {     for (int i = 1; i < n; i++) {        delta = x[i] - mean;       mean += delta / (i+1);       msq += delta * (x[i] - mean);     }     return Rcpp::List::create(Rcpp::Named("mean") = mean,                               Rcpp::Named("variance") = msq / (n-1));   }    return Rcpp::List::create(Rcpp::Named("mean") = mean,                             Rcpp::Named("variance") = NAN); } 

As you can see, it needs to store only four variables: n, delta, msq and mean and computes mean and variance simultaneously as you wanted.

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.

Similar Posts:

Rate this post

Leave a Comment