I have come across 3 questions on the title subject.

**Why is it necessary to do a normality test?**To check if data is imbalanced or not?Are these 4 methods of checking if the data follows normal distribution criteria

**both applicable to numerical and categorical variable?**I am trying to check if the data follows normal distribution by following 4 methods.- Checking Distribution
- Drawing Box Plot
- Drawing QQ Plot
- Use skewness, kurtosis criteria

Skewness for Normal Dist is 0, Kurtosis for Normal Dist is 3. Is there a

**certain bound that I can use**to guarantee that the data is normally distributed? (such as, 0 +/- 1 OR 3 +/- 1)

**Contents**hide

#### Best Answer

1) Some statistical tests are exact only if data are a random sample from a normal population. So it can be important to check whether samples are consistent with having come from a normal population. Some frequently used tests, such as t tests, are tolerant of certain departures from normality, especially when sample sizes are large.

Various tests of normality ($H_0:$ normal vs $H_a:$ not normal) are in use. We illustrate Kolmogorov-Smirnov and Shapiro-Wilk tests below. They are often useful, but not perfect:

- If sample sizes are small these tests tend not to reject samples from populations that are nearly symmetrical and lack long tails.
- If sample sizes are very large these tests may detect departures from normality that are unimportant for practical purposes. [I don't know what you mean by 'imbalanced'.]

2) For normal data, **Q-Q plots** tend to plot data points in almost a straight line. Some sample points with smallest and largest values may stray farther from the line than points between the lower and upper quartiles. Fit to a straight line is usually better for larger samples. Usually, one uses Q-Q plots (also called 'normal probability plots') to judge normality by eye—perhaps without doing a formal test.

*Examples:* Here are Q-Q plots from R statistical software of a small standard uniform sample, a moderate sized standard normal sample, and a large standard exponential sample. Only the normal sample shows a convincing fit to the red line. (The uniform sample does not have enough points to judge goodness-of-fit.)

`set.seed(424) u = runif(10); z = rnorm(75); x = rexp(1000) par(mfrow=c(1,3)) qqnorm(u); qqline(u, col="red") qqnorm(z); qqline(z, col="red") qqnorm(x); qqline(x, col="red") par(mfrow=c(1,1)) `

[In R, the default is to put data values on the vertical axis (with the option to switch axes); many textbooks and some statistical software put data values on the horizontal axis.]

The null hypothesis for a **Kolmogorov-Smirnov test** is that data come from a *specific* normal distribution–with known values for $mu$ and $sigma.$

*Examples:* The first test shows that sample `z`

from above is consistent with sampling from $mathsf{Norm}(0, 1).$ The second illustrates that the KS-test can be used with distributions other than normal. Appropriately, neither test rejects.

`ks.test(z, pnorm, 0, 1) One-sample Kolmogorov-Smirnov test data: z D = 0.041243, p-value = 0.999 alternative hypothesis: two-sided ks.test(x, pexp, 1) One-sample Kolmogorov-Smirnov test data: x D = 0.024249, p-value = 0.5989 alternative hypothesis: two-sided `

The null hypothesis for a **Shapiro-Wilk** test is that data come from *some* normal distribution, for which $mu$ and $sigma$ may be unknown. Other good tests for the same general hypothesis are in frequent use.

*Examples:* The first Shapiro-Wilk test shows that sample `z`

is consistent with sampling from some normal distribution. The second test shows good fit for a larger sample from a different normal distribution.

`shapiro.test(z) Shapiro-Wilk normality test data: z W = 0.99086, p-value = 0.8715 shapiro.test(rnorm(200, 100, 15)) Shapiro-Wilk normality test data: rnorm(200, 100, 15) W = 0.99427, p-value = 0.6409 `

*Addendum on the relatively low power of the Kolmogorov-Smirnov test,* prompted by @NickCox's comment. We took $m = 10^5$ simulated datasets of size $n = 25$ from each of three distributions: standard uniform, ('bathtub-shaped') $mathsf{Beta}(.5, .5),$ and standard exponential populations. The null hypothesis in each case is that data are normal with population mean and SD matching the distribution simulated (e.g., $mathsf{Norm}(mu=1/2, sigma=sqrt{1/8})$ for the beta data).

Power (rejection probability) of the K-S test (5% level) was $0.111$ for uniform, $0.213$ for beta, and $0.241$ for exponential. By contrast, power for the Shapiro-Wilk, testing the null hypothesis that the population has some normal distribution (level 5%), was $0.286, 0,864, 0.922,$ respectively.

The R code for the exponential datasets is shown below. All power values for both tests and each distribution are likely accurate to within about $pm 0.002$ or $pm 0.003.$

`set.seed(425); m = 10^5; n=25 pv = replicate(m, shapiro.test(rexp(n))$p.val) mean(pv < .05); 2*sd(pv < .05)/sqrt(m) [1] 0.9216 [1] 0.001700049 set.seed(425) pv = replicate(m, ks.test(rexp(25), pnorm, 1, 1)$p.val) mean(pv < .05); 2*sd(pv < .05)/sqrt(m) [1] 0.24061 [1] 0.002703469 `

Neither test is very useful for distinguishing a uniform sample of size $n=25$ from normal. Using the S-W test, samples of this size from populations with more distinctively nonnormal shapes are detected as nonnormal with reasonable power.

A **boxplot** is not really intended as a way to check for normality. However, boxplots do show outliers. Normal distributions extend in theory to $pminfty,$ even though values beyond $mu pm ksigma$ for $k = 3$ and especially $k = 4$ are quite rare. Consequently, *very many extreme outliers* in a boxplot may indicate nonnormality–especially if most of the outliers are in the same tail.

*Examples:* The boxplot at left displays the normal sample `z`

. It shows a symmetrical distribution and there happens to be one near outlier. The plot at right displays dataset `x`

; it is characteristic of exponential samples of this size to show many high outliers, some of them extreme.

`par(mfrow=c(1,2)) boxplot(z, col="skyblue2") boxplot(x, col="skyblue2") par(mfrow=c(1,1)) `

The 20 boxplots below illustrate that normal samples of size 100 often have a few boxplot outliers. So seeing a few near outliers in a boxplot is not to be taken as a warning that data may not be normal.

`set.seed(1234) x = rnorm(20*100, 100, 15) g = rep(1:20, each=100) boxplot(x ~ g, col="skyblue2", pch=20) `

More specifically, the simulation below shows that, among normal samples of size $n = 100,$ about half show at least one boxplot outlier and the average number of outliers is about $0.9.$

`set.seed(2020) nr.out = replicate(10^5, length(boxplot.stats(rnorm(100))$out)) mean(nr.out) [1] 0.9232 mean(nr.out > 0) [1] 0.52331 `

Sample **skewness** far from $0$ or sample **kurtosis** far from $3$ (or $0)$ can indicate nonnormal data. (See Comment by @NickCox.) The question is how far is too far. Personally, I have not found sample skewness and kurtosis to be more useful than other methods discussed above. I will let people who favor using these descriptive measures as normality tests explain how and with what success they have done so.

### Similar Posts:

- Solved – Limitations of Shapiro test
- Solved – Tests of normality – qq and Shapiro-Wilk
- Solved – How to measure the distance (or divergence – not sure) between data and a probability distribution
- Solved – Is the data distribution normal? (Tried Shapiro and Kolmogornov-Smirnov tests)
- Solved – Normality testing with very large sample size