I need help with data transformation. In the picture below the upper left picture shows the histogram of the variable V6. Because it is so right-skewed I tried 3 forms of transformation but none of them seem to make the data more symmetrical. Is there maybe another solution for this? Maybe change of breaks or something else?
The data is:
head of data
CanopyCover = V6
91.30
61.50
91.40
92.00
93.20
EDIT: Here is the data:
Id SqCones Ntrees DBH TreeHeight CanopyCover Abern1 61 32 0.23 20.42 91.30 Abern2 4 4 0.27 15.20 61.50 Abern3 15 34 0.17 15.97 91.40 Abern4 9 22 0.23 22.42 92.00 Abern5 42 22 0.18 19.45 93.20 Abern6 4 21 0.23 23.07 93.50 Abern7 12 19 0.22 21.06 88.50 Abern8 27 15 0.26 18.82 88.00 Abern9 0 12 0.23 19.16 89.80 Abern10 4 9 0.12 6.38 73.30 Abern11 91 5 0.79 25.50 94.80 Abern12 20 12 0.20 12.02 94.20 Abern13 5 15 0.19 9.06 76.80 Abern14 14 42 0.15 8.82 77.20 Abern15 35 74 0.15 17.91 91.30 Abern16 11 23 0.15 15.93 92.20 Abern17 47 67 0.14 13.79 91.80 Abern18 17 33 0.17 14.60 88.60 Abern19 16 12 0.34 13.99 92.40 Abern20 0 7 0.40 16.16 85.20 Abern21 44 14 0.37 20.88 92.90 Abern22 18 23 0.23 15.54 91.50 Abern23 9 13 0.27 16.98 90.70 Abern24 16 7 0.32 19.20 89.00 Abern25 60 11 0.26 20.03 93.50 Abern26 3 7 0.29 15.87 91.90 Abern27 5 10 0.35 20.87 90.70 Abern28 5 11 0.31 21.55 90.40 Abern29 2 3 0.42 20.37 69.90 Abern30 32 11 0.33 18.27 92.60 Abern31 55 15 0.32 24.50 91.40 Abern32 3 11 0.34 19.12 89.20 QEFP33 18 14 0.35 22.98 87.60 QEFP34 0 13 0.27 16.11 54.40 QEFP35 11 7 0.35 22.26 93.10 QEFP36 0 22 0.23 15.55 90.20 QEFP37 6 18 0.33 20.98 93.60 QEFP38 4 18 0.27 19.21 93.10 QEFP39 0 9 0.35 24.12 84.40 QEFP40 48 11 0.37 22.68 86.50 QEFP41 7 16 0.26 21.27 91.10 QEFP42 2 11 0.35 21.70 80.70 QEFP43 3 12 0.35 21.48 83.30 QEFP44 2 8 0.35 21.87 77.30 QEFP45 21 9 0.33 21.65 80.00 QEFP46 22 9 0.32 23.32 88.00 QEFP47 4 12 0.36 22.77 81.10 QEFP48 1 28 0.22 18.53 93.80 QEFP49 3 30 0.19 16.19 84.80 QEFP50 25 30 0.18 19.47 87.20 QEFP51 57 36 0.17 17.08 89.60 QEFP52 12 11 0.26 21.36 87.80
Best Answer
From your graphs it appears that you have about 50 measurements of percent cover. The values range from about 50% to about 100%. It's possible that some values are recorded as 100%. Presumably values cannot exceed 100% (if not, please tell us otherwise).
Note first of all that they are left-skewed, not right-skewed.
In statistics, the label for skewness is that of the longer tail: the terminology implies that you are looking at a histogram with magnitude axis horizontal. In this case, you are, so no problem there.
So, square root, cube root, logarithm can't possibly help. Those are transformations for right-skewed variables.
There is a more subtle problem too. Over that twofold range, from about 50 to about 100, those transforms are close to linear, as shown by the graphs below. So they will change the units of measurement, but more crucially they won't change the shape of the distribution much at all. That's why — although they cannot help — they in fact don't make much difference and why the histograms you show are all more or less the same shape.
It's possible that a logit transformation will help, or a folded power.
Can you post the raw data?
P.S. Normality, or otherwise, is at best marginally relevant for such bounded data. Some analyses might go better if you had a more symmetric distribution, but no more. The bigger deal is what you intend to do with the data.
EDIT on seeing the data:
There are at least two questions bundled together here:
What kind of transformation would best symmetrize a variable like canopy cover? Note that in principle such a variable is bounded between 0 and 100%. For that reason alone, many well-known distributions, including the normal, can't fit the data in principle.
What kind of scales (including quite possibly the scales on which data arrive) should be used for analysing a response variable in relation to various predictors such as canopy cover?
These questions aren't that closely related. I'll answer them in reverse order.
- Given that
SqCones
is a count, I would reach for Poisson regression. Ignoring the other predictors, a Poisson regression onCanopyCover
seems reasonable. In principle, this relation could be projected to 100% cover; the graph shows clearly that that really would be an extrapolation.
It's no part of the Poisson regression to assume that any predictor is normally, or even symmetrically, distributed. But if we felt a little squeamish about the skewness and whether it had side effects, we could try transforming to see if it made a difference. As logit is not defined for 100%, I feel hesitant about applying it here, even though all values of cover are below 100%. I tried folded square root, got predictions using that as a predictor and then plotted the predictions on top of the previous predictions:
There is no obvious disadvantage to using the original scale because predictions are close with and without the transformation.
For more about folded powers, see What is the most appropriate way to transform proportions when they are an independent variable? (which gives yet further references).
If you were curious about what transformation might make those cover data more symmetrical,
logit helps a bit, but in principle you should worry about its inapplicability to data that might have been 100% (there are fudges for the latter problem for counted data, but I don't know a good fudge here)
weaker transformations such as folded cube root or folded root do help, but not much, but at least they are defined for 100%
to correct left skewness, squares or cubes are available in principle (and perfectly well-defined for 100%), but they don't help much either
you could try, following @whuber's suggestion, to work on transforming (100% $-$ canopy cover), and my guess is that you could get closer to symmetry, but at the cost of a measure that biologically is the wrong way round (watch out too, for log 0 as a problem in principle).
I've not tried to bring your other variables into the analysis. For completeness, I will mention an unasked question,
- What kind of scales (including quite possibly the scales on which data arrive) should be used for analysing variables such as canopy cover as response variable in relation to various predictors?
The most important advice I have for this problem is that the precise distribution of a predictor usually doesn't matter much. Normal distributions are not a target: if they were, we could hardly use (0, 1) indicators as predictors, which fail dismally.
Similar Posts:
- Solved – Regression: Scatterplot with low R squared and high p-values
- Solved – Is sampling from a folded normal distribution equivalent to sampling from a normal distribution truncated at 0
- Solved – Appropriate data transformation
- Solved – Transformation of leptokurtic data
- Solved – Is log transforming square root transformed data a legitimate data transformation