Solved – Transformation of specific data

I need help with data transformation. In the picture below the upper left picture shows the histogram of the variable V6. Because it is so right-skewed I tried 3 forms of transformation but none of them seem to make the data more symmetrical. Is there maybe another solution for this? Maybe change of breaks or something else?

The data is:

head of data
CanopyCover = V6
91.30
61.50
91.40
92.00
93.20

Histogram of data

EDIT: Here is the data:

Id  SqCones Ntrees  DBH TreeHeight  CanopyCover Abern1  61  32  0.23    20.42   91.30 Abern2  4   4   0.27    15.20   61.50 Abern3  15  34  0.17    15.97   91.40 Abern4  9   22  0.23    22.42   92.00 Abern5  42  22  0.18    19.45   93.20 Abern6  4   21  0.23    23.07   93.50 Abern7  12  19  0.22    21.06   88.50 Abern8  27  15  0.26    18.82   88.00 Abern9  0   12  0.23    19.16   89.80 Abern10 4   9   0.12    6.38    73.30 Abern11 91  5   0.79    25.50   94.80 Abern12 20  12  0.20    12.02   94.20 Abern13 5   15  0.19    9.06    76.80 Abern14 14  42  0.15    8.82    77.20 Abern15 35  74  0.15    17.91   91.30 Abern16 11  23  0.15    15.93   92.20 Abern17 47  67  0.14    13.79   91.80 Abern18 17  33  0.17    14.60   88.60 Abern19 16  12  0.34    13.99   92.40 Abern20 0   7   0.40    16.16   85.20 Abern21 44  14  0.37    20.88   92.90 Abern22 18  23  0.23    15.54   91.50 Abern23 9   13  0.27    16.98   90.70 Abern24 16  7   0.32    19.20   89.00 Abern25 60  11  0.26    20.03   93.50 Abern26 3   7   0.29    15.87   91.90 Abern27 5   10  0.35    20.87   90.70 Abern28 5   11  0.31    21.55   90.40 Abern29 2   3   0.42    20.37   69.90 Abern30 32  11  0.33    18.27   92.60 Abern31 55  15  0.32    24.50   91.40 Abern32 3   11  0.34    19.12   89.20 QEFP33  18  14  0.35    22.98   87.60 QEFP34  0   13  0.27    16.11   54.40 QEFP35  11  7   0.35    22.26   93.10 QEFP36  0   22  0.23    15.55   90.20 QEFP37  6   18  0.33    20.98   93.60 QEFP38  4   18  0.27    19.21   93.10 QEFP39  0   9   0.35    24.12   84.40 QEFP40  48  11  0.37    22.68   86.50 QEFP41  7   16  0.26    21.27   91.10 QEFP42  2   11  0.35    21.70   80.70 QEFP43  3   12  0.35    21.48   83.30 QEFP44  2   8   0.35    21.87   77.30 QEFP45  21  9   0.33    21.65   80.00 QEFP46  22  9   0.32    23.32   88.00 QEFP47  4   12  0.36    22.77   81.10 QEFP48  1   28  0.22    18.53   93.80 QEFP49  3   30  0.19    16.19   84.80 QEFP50  25  30  0.18    19.47   87.20 QEFP51  57  36  0.17    17.08   89.60 QEFP52  12  11  0.26    21.36   87.80 

From your graphs it appears that you have about 50 measurements of percent cover. The values range from about 50% to about 100%. It's possible that some values are recorded as 100%. Presumably values cannot exceed 100% (if not, please tell us otherwise).

Note first of all that they are left-skewed, not right-skewed.

In statistics, the label for skewness is that of the longer tail: the terminology implies that you are looking at a histogram with magnitude axis horizontal. In this case, you are, so no problem there.

So, square root, cube root, logarithm can't possibly help. Those are transformations for right-skewed variables.

There is a more subtle problem too. Over that twofold range, from about 50 to about 100, those transforms are close to linear, as shown by the graphs below. So they will change the units of measurement, but more crucially they won't change the shape of the distribution much at all. That's why — although they cannot help — they in fact don't make much difference and why the histograms you show are all more or less the same shape.

enter image description here

It's possible that a logit transformation will help, or a folded power.

Can you post the raw data?

P.S. Normality, or otherwise, is at best marginally relevant for such bounded data. Some analyses might go better if you had a more symmetric distribution, but no more. The bigger deal is what you intend to do with the data.

EDIT on seeing the data:

There are at least two questions bundled together here:

  1. What kind of transformation would best symmetrize a variable like canopy cover? Note that in principle such a variable is bounded between 0 and 100%. For that reason alone, many well-known distributions, including the normal, can't fit the data in principle.

  2. What kind of scales (including quite possibly the scales on which data arrive) should be used for analysing a response variable in relation to various predictors such as canopy cover?

These questions aren't that closely related. I'll answer them in reverse order.

  1. Given that SqCones is a count, I would reach for Poisson regression. Ignoring the other predictors, a Poisson regression on CanopyCover seems reasonable. In principle, this relation could be projected to 100% cover; the graph shows clearly that that really would be an extrapolation.

enter image description here

It's no part of the Poisson regression to assume that any predictor is normally, or even symmetrically, distributed. But if we felt a little squeamish about the skewness and whether it had side effects, we could try transforming to see if it made a difference. As logit is not defined for 100%, I feel hesitant about applying it here, even though all values of cover are below 100%. I tried folded square root, got predictions using that as a predictor and then plotted the predictions on top of the previous predictions:

enter image description here

There is no obvious disadvantage to using the original scale because predictions are close with and without the transformation.

For more about folded powers, see What is the most appropriate way to transform proportions when they are an independent variable? (which gives yet further references).

  1. If you were curious about what transformation might make those cover data more symmetrical,

    • logit helps a bit, but in principle you should worry about its inapplicability to data that might have been 100% (there are fudges for the latter problem for counted data, but I don't know a good fudge here)

    • weaker transformations such as folded cube root or folded root do help, but not much, but at least they are defined for 100%

    • to correct left skewness, squares or cubes are available in principle (and perfectly well-defined for 100%), but they don't help much either

    • you could try, following @whuber's suggestion, to work on transforming (100% $-$ canopy cover), and my guess is that you could get closer to symmetry, but at the cost of a measure that biologically is the wrong way round (watch out too, for log 0 as a problem in principle).

I've not tried to bring your other variables into the analysis. For completeness, I will mention an unasked question,

  1. What kind of scales (including quite possibly the scales on which data arrive) should be used for analysing variables such as canopy cover as response variable in relation to various predictors?

The most important advice I have for this problem is that the precise distribution of a predictor usually doesn't matter much. Normal distributions are not a target: if they were, we could hardly use (0, 1) indicators as predictors, which fail dismally.

Similar Posts:

Rate this post

Leave a Comment