I have two datasets a training and a test dataset. The dependent variable is a proportion and there are 54 predictors which are positive and negative real numbers and another 7 predictors that are text.
There are three response variables. Total the normalized total number of hits. Treatment the normalized total number during treatment and a percent which is a ratio of the other two responses.
At the moment using lm on the percent prediction data I have a corolation of .4. 85% of the varibles are within 20% of their target. For the treatment response variable using glm in poisson mode i have a correlation of .6 percent but the variables do not match the target data at all.
I have two main issues I need advice on:
(1) it rejected the text predictors because it said factor has new level(s)
I would like it to ignore the information for those that have new level but not disregard it for those that have the correct information how do i do that?
(2) To make my dependent variable a real number, rather than a proportion bounded between 0 and 1, I was advised to transform the response using, for example, the logit transform or the Normal quantile function (qnorm
in R). The problem is that these transformations (and others like it) will map 0 and 1 to non-finite values. How can I model these data in a regression setting when the response is a proportion that can be 0 or 1?
Using linear regression with outlier removal I am able to get 2239 of 2583 testing data within 20% of their actual value I would like to have that many within 10%.
Using the posson distribution glm the amount of treatment correlates with 69%.
Ignoring this second issue for the moment, I transform the y~x1+x2 such that y=log(y/(1-y)) the correlation of my predictions to actual data drops from 6% to 2%
This is what the data looks like after the logit transform
This is what the data looks like before the log distribution
Best Answer
if Poisson regression seemed to help it may be because the right thing to do is to treat the outcomes as counts. But if it is not satisfactory negative binomial regression might be better. It allows for overdispersion and is a lot more flexible. The Poisson distribution have the property that the mean equals the variance. In real examples the variance can be less than the mean (underdispersed) or greater (overdispersed). Negative binomial regression gets around that problem because the varaince doen't have to equal teh mean. Joe Hilbe has a nice book dedicated to negative binomial regression for count data models. Maybe you can do that with your software.
Similar Posts:
- Solved – How to transform a response variable with negative values
- Solved – getting rid of negative predictions in linear regression
- Solved – getting rid of negative predictions in linear regression
- Solved – Multinomial vs Poisson for Modelling Count Data
- Solved – Multinomial vs Poisson for Modelling Count Data