Let $X = [94, 10, 100, 100, 16, 14, 100, 100, 70, 88, 100, 100, 12, 100, 100, 58, 32, 100, 32, 36, 98, 0, 100, 100, 100]$
where $X$ are students' scores (between 0 and 100), and note many full marks!
The Question is what statistics will best describe the data (note data is non-Gaussian)
Option 1
If I fit a Gaussian using maximum likelihood I will get
sample mean = 70.4, and SD = 37.96, so a mean +/- 1 SD gives an interval from 32.43 to 108.36.
Finally, If I fit a Gaussian to the data $X$ using normfit
in matlab(R)
and obtain a 95% confidence bound on the mean and standard deviation I will get
$$
begin{aligned}
mu &= 70.4 ; &CI_{95%} = [54.73, 86.06] \
sigma &= 37.96 ; &CI_{95%} = [29.64, 52.80]
end{aligned}
$$
Option 2
On the other hand, what if I use left / right SD instead? I.e., to report two SD values, SD_left and SD_right where:
$$
begin{aligned}
SD_{left} &= sqrt{frac{1}{N_{left}} * sum(X*I(X<mu) – mu)^2} &= 49.94 \
SD_{right} &= sqrt{frac{1}{N_{right}} * sum(X*I(Xgemu) – mu)^2} &= 29.45
end{aligned}
$$
where $mu=70.4$ is the mean, $N_{left} = sum(I(X<mu)) – 1 = 9$ is number of samples less than the mean (minus 1 to remove bias) and $I$ is the indicator function which gives 1 if its argument is true or else 0; $N_{right} = sum(I(Xgemu)) – 1 = 14$
In this case the interval around the mean is [20.46, 99.85], instead of the previous result, [32.43, 108.36].
Which one shall I go for, 1 or 2?
Best Answer
I would say not to use any standard deviation here. Your data isn't just skewed, it's bimoodal. This indicates something about the students: Either they "get it" or they don't. This fact is lost in any mean or standard deviation.
If you are just trying to describe the data, give percentiles. E.g. you could give 25th, 50th (median) and 75th; or perhaps quintiles. Better would be to do a density plot.
If you are trying to model the data, find a distribution that is bimodal.