Solved – Logistic regression simulation in order to show that intercept is biased when Y=1 is rare

I'm trying to simulate a logistic regression. My goal is showing that if `Y=1` is rare, than the intercept is biased. In my R script I define the logistic regression model through the latent variable's approach (see for example pp. 140 http://gking.harvard.edu/files/abs/0s-abs.shtml):

``x   <- rnorm(10000)  b0h <- numeric(1000) b1h <- numeric(1000)  for(i in 1:1000){   eps <- rlogis(10000)   eta <- 1+2*x+eps   y   <-numeric(10000)   y   <- ifelse (eta>0,1,0)    m      <- glm(y~x,family=binomial)   b0h[i] <- coef(m)[1]   b1h[i] <- coef(m)[2] }  mean(b0h) mean(b1h) hist(b0h) hist(b1h) ``

The problem here is that I don't know how to force the observations y to be balanced before (50:50), then unbalanced (90:10). As we can see with the function table(), in my script the proportion of ones is random.

``table(y) ``

How to solve this problem?

Contents

\$\$ log(text{odds}(Y=1)) = frac{exp(Pr(Y = 1))}{(1+exp(Pr(Y = 1))} \$\$ The most convenient way to do this will be to use those values for your intercept (\$beta_0\$) and have your slope be \$0\$. (Alternatively, you can use any two parameter values, \$beta_0\$ and \$beta_1\$, that you like such that, given your \$X\$ values, the log mean odds equals, e.g., \$-2.197225\$.) Here is an example with `R` code:
``lo2p = function(lo){      # this function will perform the logistic transformation   odds = exp(lo)          #   of the structural component of the data generating   p    = odds / (1+odds)  #   process   return(p) }  N     = 1000              # these are the true values of the DGP beta0 = -2.197225 beta1 = 0  set.seed(8361)            # this makes the simulation exactly reproducible x     = rnorm(N) lo    = beta0 + beta1*x p     = lo2p(lo)          # these will be the parameters of the binomial response  b0h   = vector(length=N)  # these will store the results b1h   = vector(length=N) y1prp = vector(length=N)  # (for the proportion of y=1)  for(i in 1:1000){         # here is the simulation   y        = rbinom(n=N, size=1, prob=p)   m        = glm(y~x, family=binomial)   b0h[i]   = coef(m)[1]   b1h[i]   = coef(m)[2]   y1prp[i] = mean(y) }  mean(b0h)                 # these are the results # [1] -2.205844 mean(b1h) # [1] -0.0003422177 mean(y1prp) # [1] 0.100036 hist(b0h) hist(b1h) ``