# Solved – How to make the rare events corrections described in King and Zeng (2001)

I have a dataset with a binary (survival) response variable and 3 explanatory variables (`A` = 3 levels, `B` = 3 levels, `C` = 6 levels). In this dataset, the data is well balanced, with 100 individuals per `ABC` category. I already studied the effect of these `A`, `B`, and `C` variables with this dataset; their effects are significant.

I have a subset. In each `ABC` category, 25 of the 100 individuals, of which approximately half are alive and half are dead (when less than 12 are alive or dead, the number was completed with the other category), were further investigated for a 4th variable (`D`). I see three problems here:

1. I need to weight the data the rare events corrections described in King and Zeng (2001) to take into account the approximate 50% – 50% is not equal to 0/1 proportion in the bigger sample.
2. This non-random sampling of 0 and 1 leads to a different probability for individuals to be sampled in each of the `ABC` categories, so I think I have to use true proportions from each category rather than the global proportion of 0/1 in the big sample.
3. This 4th variable has 4 levels, and the data are really not balanced in these 4 levels (90% of the data is within 1 of these levels, say level `D2`).

I have read the King and Zeng (2001) paper carefully, as well as this CV question that led me to King and Zeng (2001) paper, and later this other one that led me to try the `logistf` package (I use R).
I tried to apply what I understood from King and Zheng (2001), but I am not sure what I did is right. I understood there are two methods:

• For the prior correction method, I understood you only correct the intercept. In my case, the intercept is the `A1B1C1` category, and in this category survival is 100%, so survival in the big dataset and the subset are the same, and therefore the correction changes nothing. I suspect this method should not apply to me anyway, because I do not have an overall true proportion, but a proportion for each category, and this method ignores that.
• For the weighting method: I calculated wi, and from what I understood in the paper: "All
researchers need to do is to calculate wi in Eq. (8), choose it as the weight in their computer program, and then run a logit model". So I first ran my `glm` as:

``glm(R~ A+B+C+D, weights=wi, data=subdata, family=binomial) ``

I am not sure I should include `A`, `B`, and `C` as explanatory variables, since I normally expect them to have no effect on survival in this subsample (each category contains about the 50% dead and alive). Anyway, it should not change the output a lot if they are not significant. With this correction, I get a good fit for level `D2` (the level with most of individuals), but not at all for others levels of `D` (`D2` preponderates). See the top right graph: Fits of a non-weighted `glm` model and of a `glm` model weighted with wi. Each dot represents one category. `Proportion in the big dataset` is the true proportion of 1 in the `ABC` category in the big dataset, `Proportion in the sub dataset` is the true proportion of 1 in the `ABC` category in the subdataset, and `Model predictions` are the predictions of `glm` models fitted with the subdataset. Each `pch` symbol represents a given level of `D`. Triangles are level `D2`.

Only later when seeing there is a `logistf`, I though this is perhaps not that simple. I am not sure now. When doing `logistf(R~ A+B+C+D, weights=wi, data=subdata, family=binomial)`,
I get estimates, but the predict function does not work, and the default model test returns infinite chi squared values (except one) and all p-values = 0 (except 1).

Questions:

• Did I properly understand King and Zeng (2001)? (How far am I from understanding it?)
• In my `glm` fits, `A`, `B`, and `C` have significant effects. All this means is that I deparse a lot from the half / half proportions of 0 and 1 in my subset and differently in the different `ABC` categories – isn't that right?
• Can I apply King and Zeng's (2001) weighting correction despite the fact that I have a value of tau and a value of $$bar y$$ for each `ABC` category instead of global values?
• Is it an issue that my `D` variable is so unbalanced, and if it is, how can I handle it? (Taking into account I will already have to weight for the rare event correction…Is "double weighting", i.e. weighting the weights, possible?)
Thanks!

Edit: See what happens if I remove A, B and C from the models. I do not understand why there is such differences. Fits without A, B, and C as explanatory variables in models

Contents

The `logistf()` function do not implement rare event logistic regression, that is done by the `relogit()` function in the Zelig package, on CRAN. You should test that one!