Solved – why is the denominator of the correlation coefficient the SD of X multiplied by SD of Y

I don't quite understand what is going on in the correlation coefficient formula. In the numerator we have the covariance, and in the denominator we have the standard deviation of variable x multiplied by the standard deviation of variable y.

So ultimately it is a ratio of covariance to the product of the two standard deviations.

What is dividing by the product of the two standard deviations doing to help us determine the corrrelation ?

I have tried to draw it out to help me visually understand it as I find this helps, but am stuck as to what I should be looking for.

Covariance and correlation coefficient measure essentially the same effect: How 'linked'[1,2] are two variables, i.e. if $X$ increases, how much will $Y$ increase on average?

The problem with covariance is that its value depends on the scales of the two variables: The value of the covariance doesn't tell you much if you don't also know over what range $X$ and $Y$ vary. Therefore, you don't know whether $X$ is more 'linked' with $Y$ or with $Z$ if you only know that, say, $cov(X,Y)=1000$ and $cov(X,Z)=0.1$. Maybe $Y$ is income in dollars (with big spreads), and $Z$ is percentage of time spent brushing teeth (very small spreads) — then $X$ (amount spent on toothbrushes) may be more linked to $Z$, although the covariance value is quite lower [3].

To account for that, the correlation coefficient norms the covariance: We divide the covariance by the spreads (measured as standard deviations) of $X$ and $Y$. If you do the math (or run some simulations), you'll see that the correlation coefficient ranges from -1 (complete negative dependence) over 0 (no [linear] dependence) to 1 (complete positive dependence). Thus, it's possible to compare the degrees of 'linkage' between different pairs of variables. And after you've used them for a while, you get a feeling for how much 'linkage' exists for a certain correlation coefficient.


[1] I'd normally use 'correlated' instead of 'linked', but that might be confused with the correlation coefficient in this answer.

[2] To be more precise: 'linearly linked'. There are lots of examples where there is a clear relationship between $X$ and $Y$, but their correlation (and thus covariance) is zero, e.g. if the scatter plot of the two variables looks like a circle or a cross.

[3] And the covariance values will change if you change the units of the variables, e.g. if you express $Y$ in milliseconds or $Z$ in fraction of average income.

Similar Posts:

Rate this post

Leave a Comment