Solved – R: How to interpret the QQplot’s outlier numbers

How to interpret the labels with the outlier numbers when you plot the following in R (QQplot)

set.seed(1) y <- rnorm(100) x <- rnorm(100) plot(lm(y ~ x), which=2)   # which = 2 gives the plot 

It gives a number 61 on the top. What is it?

I figured it might be the index of the outlier couple. It appears to be connected to a score of around y = 3 and x = 3. But when:

cbind(y,x)[61,]  >  y         x  2.4016178 0.4251004  

How to read these numbers in R's QQplot?

The number in the plot corresponds to the indices of the standardized residuals and the original data. By default, R labels the three most extreme residuals, even if they don't deviate much from the QQ-line. So the fact that the points are labelled doesn't mean that the fit is bad or anything. This behaviour can be changed by specifying the option id.n. Let me illustrate this with your example

set.seed(1) y <- rnorm(100) x <- rnorm(100) lm.mod <- lm(y ~ x) # linear regression model plot(lm.mod, which=2) # QQ-Plot lm.resid <- residuals(lm(y ~ x)) # save the residuals sort(abs(lm.resid), decreasing=TRUE) # sort the absolute values of the residals         14         61         24 2.32415869 2.29316200 2.09837122 

The first three most extreme residuals are number 14, 61 and 24. These are the numbers in the plot. These indices correspond to the indices of the original data. So the data points 14, 24 and 26 are the ones that cause the most extreme residuals. We can also mark them in a scatterplot (the blue points). Note that because you generated your y and x independently, the regression line is simply the mean of y without any slope:

# The original data points corresponding to the 3 most extreme residuals  cbind(x,y)[c(14, 24, 61), ]              x         y [1,] -0.6506964 -2.214700 [2,] -0.1795565 -1.989352 [3,]  0.4251004  2.401618  # Make a scatterplot of the original data and mark the three points # and add the residuals  par(bg="white", cex=1.6) plot(y~x, pch=16, las=1) abline(lm.mod, lwd=2) # add regression line pre <- predict(lm.mod)  # Add the residual lines segments(x[c(14, 24, 61)], y[c(14, 24, 61)], x[c(14, 24, 61)],           pre[c(14, 24, 61)], col="red", lwd=2)  # Add the points points(x[c(14, 24, 61)], y[c(14, 24, 61)], pch=16, cex=1.1, col="steelblue", las=1) 


Similar Posts:

Rate this post

Leave a Comment