I am doing logistic regression in R on a binary dependent variable with only one independent variable. I found the odd ratio as 0.99 for an outcomes. This can be shown in following. Odds ratio is defined as, $ratio_{odds}(H) = frac{P(X=H)}{1-P(X=H)}$. As given earlier $ratio_{odds} (H) = 0.99$ which implies that $P(X=H) = 0.497$ which is close to 50% probability. This implies that the probability for having a H cases or non H cases 50% under the given condition of independent variable. This does not seem realistic from the data as only ~20% are found as H cases. Please give clarifications and proper explanations of this kind of cases in logistic regression.
I am hereby adding the results of my model output:
M1 <- glm(H~X, data=data, family=binomial()) summary(M1) Call: glm(formula = H ~ X, family = binomial(), data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.8563 0.6310 0.6790 0.7039 0.7608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6416666 0.2290133 7.168 7.59e-13 *** X -0.0014039 0.0009466 -1.483 0.138 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1101.1 on 1070 degrees of freedom Residual deviance: 1098.9 on 1069 degrees of freedom (667 observations deleted due to missingness) AIC: 1102.9 Number of Fisher Scoring iterations: 4 exp(cbind(OR=coef(M1), confint(M1))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 5.1637680 3.3204509 8.155564 X 0.9985971 0.9967357 1.000445
I have 1738 total dataset, of which H is a dependent binomial variable. There are 19.95% fall in (H=0) category and remaining are in (H=1) category. Further this binomial dependent variable compare with the covariate X whose minimum value is 82.23, mean value is 223.8 and maximum value is 391.6. The 667 missing values correspond to the covariate X i.e 667 data for X is missing in the dataset out of 1738 data.
Best Answer
Summary
The question misinterprets the coefficients.
The software output shows that the log odds of the response don't depend appreciably on $X$, because its coefficient is small and not significant ($p=0.138$). Therefore the proportion of positive results in the data, equal to $100 – 19.95% approx 80%$, ought to have a log odds close to the intercept of $1.64$. Indeed,
$$logleft(frac{80%}{20%}right) = log(4) approx 1.4$$
is only about one standard error ($0.22$) away from the intercept. Everything looks consistent.
Detailed analysis
This generalized linear model supposes that the log odds of the response $H$ being $1$ when the independent variable $X$ has a particular value $x$ is some linear function of $x$,
$$text{Log odds}(H=1,|,X=x) = beta_0 + beta_1 x.tag{1}$$
The glm
command in R
estimated these unknown coefficients with values $$hatbeta_0 = 1.641666pm 0.2290133$$ and $$hatbeta_1 = -0.0014039pm 0.0009466.$$
The dataset contains a large number $n$ of observations with various values of $x$, written $x_i$ for $i=1, 2, ldots, n$, which range from $82.3$ to $391.6$ and average $bar x = 223.8$. Formula $(1)$ enables us to compute the estimated probabilities of each outcome, $Pr(H=1,|,X=x_i)$. If the model is any good, the average of those probabilities ought to be close to the average of the outcomes.
Since the odds are, by definition, the ratio of a probability to its complement, we can use simple algebra to find the estimated probabilities in terms of the log odds
$$widehatPr(H=1,|,X=x) = 1 – frac{1}{1 + expleft(hatbeta_0 + hatbeta_1 xright)}.$$
As a nonlinear function of $x$, that's difficult to average. However, provided $beta_1 x$ is small (much less than $1$ in size) and $1+exp(hatbeta_0)$ is not small (it exceeds $6$ in this case), we can safely use a linear approximation
$$frac{1}{1 + expleft(hatbeta_0 + hatbeta_1 xright)} = frac{1}{1 + exp(hatbeta_0)}left(1 – hatbeta_1 x frac{exp{hatbeta_0}}{1 + exp(hatbeta_0)}right) + Oleft(hatbeta_1 xright)^2.$$
Since the $x_i$ never exceed $391.6$, $|hatbeta_1 x_i|$ never exceeds $391.6times 0.0014039 approx 0.55$, so we're ok. Consequently, the average of the outcomes may be approximated as
$$eqalign{ frac{1}{n}sum_{i=1}^n widehatPr(H=1,|,X=x) &approx frac{1}{n}sum_{i=1}^n left(1 – frac{1}{1 + exp(hatbeta_0)}left(1 – hatbeta_1 x_i frac{exp{hatbeta_0}}{1 + exp(hatbeta_0)}right)right)\ &= 0.162238 + 0.000190814 bar{x} \ &= 20.4943%. }$$
Although that's not exactly equal to the $19.95%$ observed in the data, it is more than close enough, because $hatbeta_1$ has a relatively large standard error. For example, if $beta_1$ were increased by only $0.3$ of its standard error to $-0.0011271$, then the previous calculation would produce $19.95%$ exactly.
Similar Posts:
- Solved – Converting logistic regression coefficient and confidence interval from log-odds scale to probability scale
- Solved – How to calculate Odds ratio and 95% confidence interval for logistic regression for the following data
- Solved – Interpreting logistic regression coefficients with a regularization term
- Solved – Interpreting odds ratios for logistic regression with intercept removed
- Solved – Interpreting odds ratios for logistic regression with intercept removed