# Solved – Is bootstrapping appropriate for this continuous data

I'm a complete newbie 🙂

I'm doing a study with a sample size of 10,000 from a population of about 745,000. Each sample represents a "percentage similarity". The great majority of the samples are around 97%-98% but a few are between 60% and 90%, that is, the distribution is heavily negatively skewed. Around 0.6% of the results are 0%, but these will be treated separately from the sample.

The mean of all 10,000 samples is 97.7%, and just in Excel, the StdDev is 3.20. I understand that the StdDev is not really applicable here because the results are not normally distributed (and because the +3.20 would put you above 100%!).

My questions are:

1. Is bootstrapping (a new concept for me) appropriate?
2. Am I bootstrapping correctly 🙂
3. What is a sufficient sample size?

What I am doing is resampling (with replacement) my 10,000 results and calculating a new mean. I do this a few thousand times and store each mean in an array. I then calculate the "mean of the means" and this is my statistical result. To work out the 99% CI, I choose the 0.5%-th value and the 99.5%-th value, and this produces a very tight range: 97.4% – 98.0%. Is this a valid result or am I doing something wrong?

As for sample size, I am sampling only about 1.3% of the population – I have no idea if this is "enough". How do I know if my sample is representative of the population? Ideally, I'd like to be 99% confident of a mean that is +/- 0.50% percentage points (i.e. 97.2% – 98.2%).

Thanks in advance for any tips!

Contents

The standard deviation is as applicable here as anywhere else: it gives useful information about the dispersion of the data. In particular, the sd divided by the square root of the sample size is one standard error: it estimates the dispersion of the sampling distribution of the mean. Let's calculate:

\$\$3.2% / sqrt{10000} = 0.032% = 0.00032.\$\$

That's tiny–far smaller than the \$pm 0.50%\$ precision you seek.

Although the data are not Normally distributed, the sample mean is extremely close to Normally distributed because the sample size is so large. Here, for instance, is a histogram of a sample with the same characteristics as yours and, at its right, the histogram of the means of a thousand additional samples from the same population. It looks very close to Normal, doesn't it?

Thus, although it appears you are bootstrapping correctly, bootstrapping is not needed: a symmetric \$100 – alpha%\$ confidence interval for the mean is obtained, as usual, by multiplying the standard error by an appropriate percentile of the standard Normal distribution (to wit, \$Z_{1-alpha/200}\$) and moving that distance to either side of the mean. In your case, \$Z_{1-alpha/200} = 2.5758\$, so the \$99%\$ confidence interval is

\$\$left(0.977 – 2.5758(0.032) / sqrt{10000}, 0.977 + 2.5758(0.032) / sqrt{10000}right) \ = left(97.62%, 97.78%right).\$\$

A sufficient sample size can be found by inverting this relationship to solve for the sample size. Here it tells us that you need a sample size around

\$\$(3.2% / (0.5% / Z_{1-alpha/200}))^2 approx 272.\$\$

This is small enough that we might want to re-check the conclusion that the sampling distribution of the mean is Normal. I drew a sample of \$272\$ from my population and bootstrapped its mean (for \$9999\$ iterations): Sure enough, it looks Normal. In fact, the bootstrapped confidence interval of \$(97.16%, 98.21%)\$ is almost identical to the Normal-theory CI of \$(97.19%, 98.24%)\$.

As these examples show, the absolute sample size determines the accuracy of estimates rather than the proportion of the population size. (An extreme but intuitive example is that a single drop of seawater can provide an accurate estimate of the concentration of salt in the ocean, even though that drop is such a tiny fraction of all the seawater.) For your stated purposes, obtaining a sample of \$10000\$ (which requires more than \$36\$ times as much work as a sample of \$272\$) is overkill.

`R` code to perform these analyses and plot these graphics follows. It samples from a population having a Beta distribution with a mean of \$0.977\$ and SD of \$0.032\$.

``set.seed(17) # # Study a sample of 10,000. # Sample <- rbeta(10^4, 20.4626, 0.4817) hist(Sample) hist(replicate(10^3, mean(rbeta(10^4, 20.4626, 0.4817))),xlab="%",main="1000 Sample Means") # # Analyze a sample designed to achieve a CI of width 1%. # (n.sample <- ceiling((0.032 / (0.005 / qnorm(1-0.005)))^2)) Sample <- rbeta(n.sample, 20.4626, 0.4817) cat(round(mean(Sample), 3), round(sd(Sample), 3)) # Sample statistics se.mean <- sd(Sample) / sqrt(length(Sample))      # Standard error of the mean cat("CL: ", round(mean(Sample) + qnorm(0.005)*c(1,-1)*se.mean, 5)) # Normal CI # # Compare the bootstrapped CI of this sample. # Bootstrapped.means <- replicate(9999, mean(sample(Sample, length(Sample), replace=TRUE))) hist(Bootstrapped.means) cat("Bootstrap CL:", round(quantile(Bootstrapped.means, c(0.005, 1-0.005)), 5)) ``

Rate this post