In my university we learned a lot about normal data, but handling non-normal data wasn't really covered. For some benchmarking of an application I have data has a very high frequency around one value
Doing my standard tests, additionally to the histogram, to check for normality are qqplot()
and qqline()
It is clear that this dataset deviates from a normal distribution. Now the question remains: how to present this data in an acceptable and understandable manner. My first idea was to just display the median, but I also want to account for the error in measurement.
The two options I'm considering are boxplots
or present the CI with bootstrapping
. With boxplots
I have some reservations about the readability in combination with non-normal data.
Therefore I've explored bootstrapping
. However in order to do using the mean
(red line) seems wrong and therefore I would use the trimmed mean 20%
(purple line)
Using the R library simpleboot
I could use the trimmed mean
and get the Bootstrap Bias-Corrected Accelerated (BCa)
as discussed in Bootstrap Methods and Permutation Tests, Hesterberg values for the error ranges.
Summary:
I have non-normal data which I'm not experienced with. I'm wondering if using bootstrapping
to dervice the BCa is the/a correct way to present this type of data. Also I'm worrying that I'm overcomplicating this.
EDIT:
as requested in the output of dput
:
100272, 101960, 101972, 101988, 101988, 101988, 101988, 101988,
101988, 101988, 101408, 101428, 101432, 101432, 101432, 101432,
101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432,
101432, 101432, 101432, 101432, 101428, 101400, 101420, 101424,
101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,
101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,
101428, 101420, 101424, 101428, 101428, 101428, 101428, 101428,
101428, 101428, 101428, 101432, 101432, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101412, 101432, 101436, 101436,
101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101436, 101436, 101436, 101408,
101428, 101432, 101436, 101436, 101436, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101436
Best Answer
First off, there is not "the" correct way of presenting your data. Depending on what is important or relevant to you, there may be different possibilities.
You only have 101 data points. In such a situation, it often makes sense to plot the raw data instead of summary graphics like histograms or smoothed density plots.
The raw data can be shown using q-q-plots, as you do, or using the ECDF, as Frank Harrell suggests. However, I don't think a rug plot will be very enlightening, because of the sheer concentration of 83% of your data points in the interval $[101,428; 101,436]$.
I personally like the beanplot and the beeswarm, and since your data is essentially discrete (it all comes in multiples of 4), the sunflowerplot:
library(beanplot) library(beeswarm) foo <- c(100272, 101960, 101972, 101988, 101988, 101988, 101988, 101988, 101988, 101988, 101408, 101428, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432, 101428, 101400, 101420, 101424, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101420, 101424, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428, 101432, 101432, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101412, 101432, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101408, 101428, 101432, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436) opar <- par(mfrow=c(1,3)) beanplot(foo,col="grey",yaxt="n",main="Bean plot/nViolin plot",border="black") beeswarm(foo,pch=19,main="Beeswarmnplot") sunflowerplot(x=rep(1,length(foo)),y=foo,main="Sunflowernplot",xlab="",xaxt="n",ylab="") par(opar)
All of these show the high concentration around specific values already mentioned, even if the beanplot is not overly pretty and the beeswarm cuts off a number of data points.
Here are a few more possibilities.
Edit: in each case, you can add the trimmed mean as a horizontal line using abline
. I played around a bit with bootstrapping it; however, once you trim off 20%, the trimmed mean almost does not vary at all any more, so even a 99% confidence interval has a width of only 2.9 and is essentially invisible on these scales. In any case, your distribution is so enormously discrete that the mean, whether trimmed or not, does not really convey a lot of information. Given this utter discreteness, you may even want to plot a simple table of the values, which is the logical end point of making histogram bins smaller and smaller:
plot(as.numeric(names(table(foo))),table(foo),type="h",xlab="",ylab="") axis(2)
Similar Posts:
- Solved – meant by Low Bias and High Variance of the Model
- Solved – How to calculate the confidence interval of a mean in a non-normally distributed sample
- Solved – Mean has lower standard error than 5% trimmed mean
- Solved – Mean has lower standard error than 5% trimmed mean
- Solved – An X% trimmed mean means