Solved – The correct way to display non-normal data

In my university we learned a lot about normal data, but handling non-normal data wasn't really covered. For some benchmarking of an application I have data has a very high frequency around one value enter image description here

Doing my standard tests, additionally to the histogram, to check for normality are qqplot() and qqline()
enter image description here

It is clear that this dataset deviates from a normal distribution. Now the question remains: how to present this data in an acceptable and understandable manner. My first idea was to just display the median, but I also want to account for the error in measurement.

The two options I'm considering are boxplots or present the CI with bootstrapping. With boxplots I have some reservations about the readability in combination with non-normal data.

Therefore I've explored bootstrapping. However in order to do using the mean (red line) seems wrong and therefore I would use the trimmed mean 20% (purple line)

enter image description here

Using the R library simpleboot I could use the trimmed mean and get the Bootstrap Bias-Corrected Accelerated (BCa) as discussed in Bootstrap Methods and Permutation Tests, Hesterberg values for the error ranges.

Summary:

I have non-normal data which I'm not experienced with. I'm wondering if using bootstrapping to dervice the BCa is the/a correct way to present this type of data. Also I'm worrying that I'm overcomplicating this.

EDIT:
as requested in the output of dput:


100272, 101960, 101972, 101988, 101988, 101988, 101988, 101988,
101988, 101988, 101408, 101428, 101432, 101432, 101432, 101432,
101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432,
101432, 101432, 101432, 101432, 101428, 101400, 101420, 101424,
101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,
101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,
101428, 101420, 101424, 101428, 101428, 101428, 101428, 101428,
101428, 101428, 101428, 101432, 101432, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101412, 101432, 101436, 101436,
101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101436, 101436, 101436, 101408,
101428, 101432, 101436, 101436, 101436, 101436, 101436, 101436,
101436, 101436, 101436, 101436, 101436

Best Answer

First off, there is not "the" correct way of presenting your data. Depending on what is important or relevant to you, there may be different possibilities.

You only have 101 data points. In such a situation, it often makes sense to plot the raw data instead of summary graphics like histograms or smoothed density plots.

The raw data can be shown using q-q-plots, as you do, or using the ECDF, as Frank Harrell suggests. However, I don't think a rug plot will be very enlightening, because of the sheer concentration of 83% of your data points in the interval $[101,428; 101,436]$.

I personally like the beanplot and the beeswarm, and since your data is essentially discrete (it all comes in multiples of 4), the sunflowerplot:

library(beanplot) library(beeswarm)  foo <- c(100272, 101960, 101972, 101988, 101988, 101988, 101988, 101988,  101988, 101988, 101408, 101428, 101432, 101432, 101432, 101432,  101432, 101432, 101432, 101432, 101432, 101432, 101432, 101432,  101432, 101432, 101432, 101432, 101428, 101400, 101420, 101424,  101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,  101428, 101428, 101428, 101428, 101428, 101428, 101428, 101428,  101428, 101420, 101424, 101428, 101428, 101428, 101428, 101428,  101428, 101428, 101428, 101432, 101432, 101436, 101436, 101436,  101436, 101436, 101436, 101436, 101412, 101432, 101436, 101436,  101436, 101436, 101436, 101436, 101436, 101436, 101436, 101436,  101436, 101436, 101436, 101436, 101436, 101436, 101436, 101408,  101428, 101432, 101436, 101436, 101436, 101436, 101436, 101436,  101436, 101436, 101436, 101436, 101436)  opar <- par(mfrow=c(1,3))     beanplot(foo,col="grey",yaxt="n",main="Bean plot/nViolin plot",border="black")     beeswarm(foo,pch=19,main="Beeswarmnplot")     sunflowerplot(x=rep(1,length(foo)),y=foo,main="Sunflowernplot",xlab="",xaxt="n",ylab="") par(opar) 

plots

All of these show the high concentration around specific values already mentioned, even if the beanplot is not overly pretty and the beeswarm cuts off a number of data points.

Here are a few more possibilities.

Edit: in each case, you can add the trimmed mean as a horizontal line using abline. I played around a bit with bootstrapping it; however, once you trim off 20%, the trimmed mean almost does not vary at all any more, so even a 99% confidence interval has a width of only 2.9 and is essentially invisible on these scales. In any case, your distribution is so enormously discrete that the mean, whether trimmed or not, does not really convey a lot of information. Given this utter discreteness, you may even want to plot a simple table of the values, which is the logical end point of making histogram bins smaller and smaller:

plot(as.numeric(names(table(foo))),table(foo),type="h",xlab="",ylab="") axis(2) 

barplot

Similar Posts:

Rate this post

Leave a Comment