In this question, I would like to ask two things:
- outlier detection
- normality test
Details are as follows:
I need to detect and remove outliers in my data. Before doing that, I want to test if my data is normally distributed or not. I have two variables X(independent) and Y(dependent) and have 951 records for both of them.
I want to know that while testing the normality, do I need consider both the variables simultaneously or both the variables but one at a time? (Somewhere, I have read that only dependent variable is considered to test the normality).
The attached figures show the results of normality test (Analyse>>Descriptive >>Explore) of dependent variable. If normality test is done only on dependent variable, then it shows that the data is highly skewed. In such a case, how can I remove the outliers?
The significance level of Shapiro-Wilk test and Kolmogorov-Smirnov test is 0.00. Skewness has statistic of 22.909 with SE of 0.079.
Best Answer
You write:
I need to detect and remove outliers in my data
Why do you need to do this? Detecting outliers is a good thing, but automatically removing them is not (and that seems like what you want to do). Since, from your question, it seems like you have some sort of regression problem, you should consider keeping the data and changing the regression method to e.g. quantile regression or robust regression.
You should also be aware that even OLS regression does not make assumptions about the distribution of the data (except that the DV is continuous or nearly so) but about the error.
You then write:
Before doing that, I want to test if my data is normally distributed or not
Again, why? But, if you want to test normality, as @brad said in his answer, graphical methods are best. I like both density plots (as Brad suggested) and quantile plot (as Nick suggested). However, the latter take a bit of experience to use well. You could also try box plots.
Then you write:
I want to know that while testing the normality, do I need consider both the variables simultaneously or both the variables but one at a time? (Somewhere, I have read that only dependent variable is considered to test the normality).
This makes me strongly suspect you are doing regression. As I noted above, neither variable needs to be normally distributed (I don't doubt that you read what you read, but it's incorrect).
Finally, you show a histogram of your DV. The histogram is not a very useful plot (as William S. Cleveland notes in his books).
Similar Posts:
- Solved – Outlier detection and normality assumption
- Solved – How to do normality test if one factor consists of 5 questions
- Solved – Absolute variable as dependent variable
- Solved – What are the ‘critical’ values of skewness and kurtosis for normality assumption?
- Solved – QQ Plot and Shapiro-Wilk Test Disagree