Solved – What’s the added value of SD line over regression line when examining association between 2 variables

I'm trying to incorporate different practices to use when exploring a new data set. Especially, how to examine the association between two variables.

Steps for example (not necessarily by order):

  1. plot a y-by-x scatter plot of the raw data to see the relationship visually.
  2. compute summary statistics for each variable (mean and sd)
  3. compute correlation coefficient r
  4. draw the OLS regression line, compute its slope and intercept
  5. etc….

I've come across the "SD line" in Freedman's Statistics book, which is defined as:

"the line that goes through the point of averages and climbs at the rate of one vertical SD for each horizontal SD" Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th edn).

Since this book ("Statistics") is a canonical textbook, I consider its choice to discuss the SD line as an indication for the line's importance. However, a simple google search for the term "SD line" doesn't yield as many independent results. Most of them come directly from Freedman's book. This tells me it's not a central concept in bivariate analyses in general.

When comparing the SD with the OLS regression line, it seems like the regression line is more informative (than the SD line) for predicting y from x. Therefore, I'm wondering if bothering to plot the SD line has any benefit or added value that I would not already have when plotting the regression line.

Example using mtcars dataset, focusing on association between weight and mpg

data(mtcars)  ## calculate means mean_wt <- mean(mtcars$wt) mean_mpg <- mean(mtcars$mpg)  ## calculate standard deviations sd_wt <- sd(mtcars$wt) sd_mpg <- sd(mtcars$mpg)  ## scatter plot plot(x = mtcars$wt, y = mtcars$mpg)  ## add the "point of averages" points(mean_wt, mean_mpg, col = "red", cex = 1.5, pch = 16)  ## calculate the slope of the sd line slope <- -1*sd_mpg/sd_wt  ## plot the sd line curve(expr = x*slope + (mean_mpg - slope*mean_wt), add = TRUE, col = 'blue', lwd = 2, type = "l", lty = 2)  ## plot the regression line model <- lm(mpg ~ wt, data = mtcars) abline(model, col = "orange", lwd = 2)  ## legend legend("topright",        legend = c("Regression line", "SD line"),        col = c("orange", "blue"),        lty = c(1, 2),        lwd = c(2, 2)) 

enter image description here

Thus, my question: how can the SD line increase one's understanding about the relationship between two variables, in a way that's either adding or complementing on what the regression line already tells?

Best Answer

The SD line is a didactical and visual aid to help seeing the relation for the slope of the regular regression line.

$$text {slope regression } = r_{xy} , frac {sigma_y}{sigma_x} = r_{xy} , text {slope SD line} $$

The SD line shows how x and y are varying and this can give a more or less steep or flat line depending on the ratio $ frac {sigma_y}{sigma_x}$.

The regression line will be always with a smaller slope than the SD line(You might relate this to regression to the mean). By how much smaller will depend on the correlation. The SD line will help to see and get this view/interpretation of the regression line.

The higher $R^2$ the more the model explains the variance in the data, and the closer the regression line will be to the SD line.

The image below may illustrate how that SD line helps/works. For data with $sigma_x = sigma_y = 1$ but with different correlations the SD line and the regression line are drawn. Note that the regression line is closer to te SD line for larger correlations (but still always with a smaller slope).


# random data set.seed(1) x <- rnorm(100,0,1) y <- rnorm(100,0,1)  #normalizing x <- (x-mean(x))/sd(x) y <- (y-mean(y))/sd(y)  #making x and y uncorrelated x <- x-cor(x,y)*y cor(x,y) x <- x/sd(x)  # plotting cases with sd_x=sd_y=1 and different correlations for (rho in c(0.1,0.3,0.5,0.7)) {   b <- sqrt(1/(1-rho^2)-1)   z <- (y+b*x)/sqrt(1+b^2)   plot(x,z,        xlim = c(-5,5),ylim=c(-5,5),        pch=21,col=1,bg=1,cex=0.7 )   title(bquote(rho == .(rho)),line = 1)   lines(c(-10,10),c(-10,10),lty=2)   lines(c(-10,10),c(-10,10)*rho)   if (rho == 0.1) {     legend(-5,5,c("sd line","regression line"),lty=c(2,1),cex=0.9)   } } 

Similar descriptions


Similar Posts:

Rate this post

Leave a Comment