I want to predict a continuous variable like porosity from remote sensing data. Let's say I have a measure of the reflectivity of the surface of the earth, densely sampled over an area. And I have rock porosity at some locations. So I'd like to calibrate the former to the latter, and transform reflectivity to porosity.
Since porosity p is the unknown, I instinctively plot it on the y-axis, against reflectivity r on the x. Then do a linear regression (let's pretend life is this simple for a mo), and get my equation $p = a r + b$, and go off on my merry way.
But this feels wrong. Reflectivity depends on porosity (in my hypothesis, and let's say that we know it does in nature). So in fact I should plot reflectivity on the y-axis, vs porosity on the x. Then get my linear equation, then rearrange to solve for $p = (r – b)/a$.
These approaches give different answers. I found this a bit surprising. Most people I've asked about this say, 'does it matter?'. It does, and the difference can be quite big (several percent).
Which one is the correct approach? Why? (For what it's worth, my answer is the latter approach, because it 'respects' nature, to anthropomorphize a bit… Not sure I can really explain myself any better).
Best Answer
As far as I can tell, you have a lot of reflectivity measurements and a few porosity measurements, and you want to estimate porosity in areas where you only have reflectivity information.
In that case, you want to regress porosity ($y$) on reflectivity ($x$) to minimise the sum of squares error in the porosity estimates. Any physical causal relationship should not affect this.
Similar Posts:
- Solved – plot LDA fit using R function plot()
- Solved – Should we remove axis and use direct label on scientific chart
- Solved – Creating and interpreting Bland-Altman plot
- Solved – Iterating through rows and filtering data in R
- Solved – How to change baseline patient in Predict function in rms package in R