I am creating a component that aims to calculate the average and variance of a metric associated with events happening during time but with a limited internal memory.

Imagine that the events are visitors entering in a shop and the metric is their age.

During time, my component receives events with the age of each visitor.

I don't want my component to memorize the history of each ages.

Ideally, I would like a light component storing only:

the average `A`

, the variance V and the number of events `N`

.

After each event with age `E`

, I want to update those three values :

`N<=N+1 A<=(A*N+E)/(N+1) V<=??? `

What for `V`

? I am thinking of something like :

`V<=(V*N+(E-A)^2)/(N+1) `

I know it is not exact as my previous `V`

is using the old `A`

which is no more the average.

Q1 – Is there an exact formula?

Q2 – If not, is my proposal a good estimate? Is it biased? Will it converge correctly when `N`

increases?

Q3 – Is there a better formula?

**Contents**hide

#### Best Answer

Nice and simple algorithm for computing variance in online manner was described by Welford (1962). Below you can see C++/Rcpp implementation of it that works offline, but can be easily adapted to online scenario:

`List welford_cpp(NumericVector x) { int n = x.length(); double delta; double msq = 0; double mean = x[0]; if (n > 1) { for (int i = 1; i < n; i++) { delta = x[i] - mean; mean += delta / (i+1); msq += delta * (x[i] - mean); } return Rcpp::List::create(Rcpp::Named("mean") = mean, Rcpp::Named("variance") = msq / (n-1)); } return Rcpp::List::create(Rcpp::Named("mean") = mean, Rcpp::Named("variance") = NAN); } `

As you can see, it needs to store only four variables: `n`

, `delta`

, `msq`

and `mean`

and computes mean and variance simultaneously as you wanted.

Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.