# Solved – Logistic regression simulation in order to show that intercept is biased when Y=1 is rare

I'm trying to simulate a logistic regression. My goal is showing that if `Y=1` is rare, than the intercept is biased. In my R script I define the logistic regression model through the latent variable's approach (see for example pp. 140 http://gking.harvard.edu/files/abs/0s-abs.shtml):

``x   <- rnorm(10000)  b0h <- numeric(1000) b1h <- numeric(1000)  for(i in 1:1000){   eps <- rlogis(10000)   eta <- 1+2*x+eps   y   <-numeric(10000)   y   <- ifelse (eta>0,1,0)    m      <- glm(y~x,family=binomial)   b0h[i] <- coef(m)   b1h[i] <- coef(m) }  mean(b0h) mean(b1h) hist(b0h) hist(b1h) ``

The problem here is that I don't know how to force the observations y to be balanced before (50:50), then unbalanced (90:10). As we can see with the function table(), in my script the proportion of ones is random.

``table(y) ``

How to solve this problem?

Contents

Logistic regression doesn't really have an error term. Alternatively, you can think of the response distribution (the binomial) as having its random component intrinsically 'built-in' (for more, it may help to read my answer here: Difference between logit and probit models). As a result, I think it is conceptually clearer to generate data for simulations directly from a binomial parameterized as the logistic transformation of the structural component of the model, rather than use the logistic as a sort of error term.

From there, if you want the long run probability that \$Y = 1\$ to be \$.5\$ or \$.1\$, you just need your structural component to be balanced around \$0\$ (for \$.5\$), or \$-2.197225\$ (for \$.1\$). I got those values by converting the response probability to the log odds:
\$\$ log(text{odds}(Y=1)) = frac{exp(Pr(Y = 1))}{(1+exp(Pr(Y = 1))} \$\$ The most convenient way to do this will be to use those values for your intercept (\$beta_0\$) and have your slope be \$0\$. (Alternatively, you can use any two parameter values, \$beta_0\$ and \$beta_1\$, that you like such that, given your \$X\$ values, the log mean odds equals, e.g., \$-2.197225\$.) Here is an example with `R` code:

``lo2p = function(lo){      # this function will perform the logistic transformation   odds = exp(lo)          #   of the structural component of the data generating   p    = odds / (1+odds)  #   process   return(p) }  N     = 1000              # these are the true values of the DGP beta0 = -2.197225 beta1 = 0  set.seed(8361)            # this makes the simulation exactly reproducible x     = rnorm(N) lo    = beta0 + beta1*x p     = lo2p(lo)          # these will be the parameters of the binomial response  b0h   = vector(length=N)  # these will store the results b1h   = vector(length=N) y1prp = vector(length=N)  # (for the proportion of y=1)  for(i in 1:1000){         # here is the simulation   y        = rbinom(n=N, size=1, prob=p)   m        = glm(y~x, family=binomial)   b0h[i]   = coef(m)   b1h[i]   = coef(m)   y1prp[i] = mean(y) }  mean(b0h)                 # these are the results #  -2.205844 mean(b1h) #  -0.0003422177 mean(y1prp) #  0.100036 hist(b0h) hist(b1h) ``  Rate this post