I'm trying to simulate a logistic regression. My goal is showing that if `Y=1`

is rare, than the intercept is biased. In my R script I define the logistic regression model through the latent variable's approach (see for example pp. 140 http://gking.harvard.edu/files/abs/0s-abs.shtml):

`x <- rnorm(10000) b0h <- numeric(1000) b1h <- numeric(1000) for(i in 1:1000){ eps <- rlogis(10000) eta <- 1+2*x+eps y <-numeric(10000) y <- ifelse (eta>0,1,0) m <- glm(y~x,family=binomial) b0h[i] <- coef(m)[1] b1h[i] <- coef(m)[2] } mean(b0h) mean(b1h) hist(b0h) hist(b1h) `

The problem here is that I don't know how to force the observations y to be balanced before (50:50), then unbalanced (90:10). As we can see with the function table(), in my script the proportion of ones is random.

`table(y) `

How to solve this problem?

**Contents**hide

#### Best Answer

Logistic regression doesn't really have an error term. Alternatively, you can think of the response distribution (the binomial) as having its random component intrinsically 'built-in' (for more, it may help to read my answer here: Difference between logit and probit models). As a result, I think it is conceptually clearer to generate data for simulations directly from a binomial parameterized as the logistic transformation of the structural component of the model, rather than use the logistic as a sort of error term.

From there, if you want the long run probability that $Y = 1$ to be $.5$ or $.1$, you just need your structural component to be balanced around $0$ (for $.5$), or $-2.197225$ (for $.1$). I got those values by converting the response probability to the log odds:

$$ log(text{odds}(Y=1)) = frac{exp(Pr(Y = 1))}{(1+exp(Pr(Y = 1))} $$ The most convenient way to do this will be to use those values for your intercept ($beta_0$) and have your slope be $0$. (Alternatively, you can use any two parameter values, $beta_0$ and $beta_1$, that you like such that, given your $X$ values, the log mean odds equals, e.g., $-2.197225$.) Here is an example with `R`

code:

`lo2p = function(lo){ # this function will perform the logistic transformation odds = exp(lo) # of the structural component of the data generating p = odds / (1+odds) # process return(p) } N = 1000 # these are the true values of the DGP beta0 = -2.197225 beta1 = 0 set.seed(8361) # this makes the simulation exactly reproducible x = rnorm(N) lo = beta0 + beta1*x p = lo2p(lo) # these will be the parameters of the binomial response b0h = vector(length=N) # these will store the results b1h = vector(length=N) y1prp = vector(length=N) # (for the proportion of y=1) for(i in 1:1000){ # here is the simulation y = rbinom(n=N, size=1, prob=p) m = glm(y~x, family=binomial) b0h[i] = coef(m)[1] b1h[i] = coef(m)[2] y1prp[i] = mean(y) } mean(b0h) # these are the results # [1] -2.205844 mean(b1h) # [1] -0.0003422177 mean(y1prp) # [1] 0.100036 hist(b0h) hist(b1h) `

### Similar Posts:

- Solved – Logistic regression coefficients and exp(coefficients) meaning and relationship
- Solved – Does it make sense to use Logistic regression with binary outcome and predictor
- Solved – Simulating data for logistic regression with a categorical variable
- Solved – Probability from Logit Regression
- Solved – Why is the logistic regression hypothesis seen as a probability function