Solved – When does taking the log transformation of a univariate not remove skew

So I was playing with some data today, and I plotted a histogram of it. I obtained the following distribution:

Incredibly skewed! To fix this skewness, it makes sense to take the natural logarithm of the distribution:

Okay – now the distribution doesn't look so normal. Taking the log didn't remove any skew. When does takibng the logarithm – not remove any skew?

Contents

The data come from here and are evidently people's ages. The precise age range is from 19 to 75 years. There are 1000 values.

The same data also appear in earlier threads started by @zero:

If the mean and median underestimate the true central tendency, why use them?

Using MLE to determine parameters for QQ plot

I am always eager to see graphs first, but some numerical results are relevant to the question too. In addition to considering the logarithmic transformation, the lower limit well above 0 implies that we should also be considering transformations of the form log(age \$- k\$) for age \$>k\$. Indeed @zero in the second thread (see above) used a three-parameter lognormal, itself implying that such a transformation is more appropriate.

For concreteness, I tried \$k =\$ 18.

Here are the conventional moment-based measures as calculated in Stata. The formulas used are documented here and also discussed here. That is likely to be unimportant except that other software subtracts 3 in presenting kurtosis, and other software may otherwise use slightly different formulas. Most crucially, with the definition used, a normal or Gaussian would have kurtosis 3.

The column headed (*) is (mean \$-\$ median)/SD which has the convenient virtue of being bounded by \$-\$1 and 1. More obviously, it is 0 if and only if mean = median. Sample skewness and kurtosis are bounded by functions of sample size, but those limits don't bite here: for example, with 1000 values skewness can't exceed 31.606 (3 d.p.).

`` n = 1000       |       mean          SD    skewness    kurtosis     (*) ----------------+--------------------------------------------------------             age |     35.542      11.353       1.023       3.611    0.224         log age |      3.524       0.299       0.414       2.453    0.093   log (age - 18) |      2.641       0.706      -0.442       2.970   -0.095  ------------------------------------------------------------------------- ``

What wording you want to apply is a matter of taste as well as experience, but I would suggest that the raw data are moderately skewed (I wouldn't use @zero's wording "incredibly" at all), logarithmic transformation helps, and the detail of whether you subtract a constant first is important.

In terms of the original question:

1. The logarithmic transformation is not as ineffective as implied, especially when you generalise it.

2. It remains true that there is no guarantee that logarithmic transformation will symmetrize, let alone render normal, any distribution, even if we restrict discussion, as we should, to variables that are all positive before we transform.

Rate this post