Solved – Statistical methods for data where only a minimum/maximum value is known

Is there a branch of statistics that deals with data for which exact values are not known, but for each individual, we know either a maximum or minimum bound to the value?

I suspect that my problem stems largely from the fact that I am struggling to articulate it in statistical terms, but hopefully an example will help to clarify:

Say there are two connected populations $A$ and $B$ such that, at some point, members of $A$ may "transition" into $B$, but the reverse is not possible. The timing of the transition is variable, but non-random. For example, $A$ could be "individuals without offspring" and $B$ "individuals with at least one offspring". I am interested in the age this progression occurs but I only have cross-sectional data. For any given individual, I can find out if they belong to $A$ or $B$. I also know the age of these individuals. For each individual in population $A$, I know that the age at transition will be GREATER THAN their current age. Likewise, for members of $B$, I know that the age at transition was LESS THAN their current age. But I don't know the exact values.

Say I have some other factor that I want to compare with the age of transition. For example, I want to know whether an individual's subspecies or body size affects the age of first offspring. I definitely have some useful information that should inform those questions: on average, of the individuals in $A$, older individuals will have a later transition. But the information is imperfect, particularly for younger individuals. And vice versa for population $B$.

Are there established methods to deal with this sort of data? I do not necessarily need a full method of how to carry out such an analysis, just some search terms or useful resources to start me off in the right place!

Caveats: I am making the simplifying assumption that transition from $A$ to $B$ is instantaneous. I am also prepared to assume that most individuals will at some point progress to $B$, assuming they live long enough. And I realise that longitutinal data would be very helpful, but assume that it is not available in this case.

Apologies if this is a duplicate, as I said, part of my problem is that I don't know what I should be searching for. For the same reason, please add other tags if appropriate.

Sample dataset: Ssp indicates one of two subspecies, $X$ or $Y$. Offspring indicates either no offspring ($A$) or at least one offspring ($B$)

 age ssp offsp   21   Y     A   20   Y     B   26   X     B   33   X     B   33   X     A   24   X     B   34   Y     B   22   Y     B   10   Y     B   20   Y     A   44   X     B   18   Y     A   11   Y     B   27   X     A   31   X     B   14   Y     B   41   X     B   15   Y     A   33   X     B   24   X     B   11   Y     A   28   X     A   22   X     B   16   Y     A   16   Y     B   24   Y     B   20   Y     B   18   X     B   21   Y     B   16   Y     B   24   Y     A   39   X     B   13   Y     A   10   Y     B   18   Y     A   16   Y     A   21   X     A   26   X     B   11   Y     A   40   X     B    8   Y     A   41   X     B   29   X     B   53   X     B   34   X     B   34   X     B   15   Y     A   40   X     B   30   X     A   40   X     B 

Edit: example dataset changed as it wasn't very representative

This is referred to as current status data. You get one cross sectional view of the data, and regarding the response, all you know is that at the observed age of each subject, the event (in your case: transitioning from A to B) has happened or not. This is a special case of interval censoring.

To formally define it, let $T_i$ be the (unobserved) true event time for subject $i$. Let $C_i$ the inspection time for subject $i$ (in your case: age at inspection). If $C_i < T_i$, the data are right censored. Otherwise, the data are left censored. We are interesting in modeling the distribution of $T$. For regression models, we are interested in modeling how that distribution changes with a set of covariates $X$.

To analyze this using interval censoring methods, you want to put your data into the general interval censoring format. That is, for each subject, we have the interval $(l_i, r_i)$, which represents the interval in which we know $T_i$ to be contained. So if subject $i$ is right censored at inspection time $c_i$, we would write $(c_i, infty)$. If it is left censored at $c_i$, we would represent it as $(0, c_i)$.

Shameless plug: if you want to use regression models to analyze your data, this can be done in R using icenReg (I'm the author). In fact, in a similar question about current status data, the OP put up a nice demo of using icenReg. He starts by showing that ignoring the censoring part and using logistic regression leads to bias (important note: he is referring to using logistic regression without adjusting for age. More on this later.)

Another great package is interval, which contains log-rank statistic tests, among other tools.

EDIT:

@EdM suggested using logistic regression to answer the problem. I was unfairly dismissive of this, saying that you would have to worry about the functional form of time. While I stand behind the statement that you should worry about the functional form of time, I realized that there was a very reasonable transformation that leads to a reasonable parametric estimator.

In particular, if we use log(time) as a covariate in our model with logistic regression, we end up with a proportional odds model with a log-logistic baseline.

To see this, first consider that the proportional odds regression model is defined as

$text{Odds}(t|X, beta) = e^{X^T beta} text{Odds}_o(t)$

where $text{Odds}_o(t)$ is the baseline odds of survival at time $t$. Note that the regression effects are the same as with logistic regression. So all we need to do now is show that the baseline distribution is log-logistic.

Now consider a logistic regression with log(Time) as a covariate. We then have

$P(Y = 1 | T = t) = frac{exp(beta_0 + beta_1 log(t))}{1 + exp(beta_0 + beta_1log(t))}$

With a little work, you can see this as the CDF of a log-logistic model (with a non-linear transformation of the parameters).

R demonstration that the fits are equivalent:

> library(icenReg) > data(miceData) >  > ## miceData contains current status data about presence  > ## of tumors at sacrifice in two groups > ## in interval censored format:  > ## l = lower end of interval, u = upper end > ## first three mice all left censored >  > head(miceData, 3)   l   u grp 1 0 381  ce 2 0 477  ce 3 0 485  ce >  > ## To fit this with logistic regression,  > ## we need to extract age at sacrifice > ## if the observation is left censored,  > ## this is the upper end of the interval > ## if right censored, is the lower end of interval >  > age <- numeric() > isLeftCensored <- miceData$l == 0 > age[isLeftCensored] <- miceData$u[isLeftCensored] > age[!isLeftCensored] <- miceData$l[!isLeftCensored] >  > log_age <- log(age) > resp <- !isLeftCensored >  >  > ## Fitting logistic regression model > logReg_fit <- glm(resp ~ log_age + grp,  +                     data = miceData, family = binomial) >  > ## Fitting proportional odds regression model with log-logistic baseline > ## interval censored model > ic_fit <- ic_par(cbind(l,u) ~ grp,  +            model = 'po', dist = 'loglogistic', data = miceData) >  > summary(logReg_fit)  Call: glm(formula = resp ~ log_age + grp, family = binomial, data = miceData)  Deviance Residuals:      Min       1Q   Median       3Q      Max   -2.1413  -0.8052   0.5712   0.8778   1.8767    Coefficients:              Estimate Std. Error z value Pr(>|z|)    (Intercept)  18.3526     6.7149   2.733  0.00627 ** log_age      -2.7203     1.0414  -2.612  0.00900 ** grpge        -1.1721     0.4713  -2.487  0.01288 *  --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for binomial family taken to be 1)      Null deviance: 196.84  on 143  degrees of freedom Residual deviance: 160.61  on 141  degrees of freedom AIC: 166.61  Number of Fisher Scoring iterations: 5  > summary(ic_fit)  Model:  Proportional Odds Baseline:  loglogistic  Call: ic_par(formula = cbind(l, u) ~ grp, data = miceData, model = "po",      dist = "loglogistic")            Estimate Exp(Est) Std.Error z-value        p log_alpha    6.603 737.2000   0.07747  85.240 0.000000 log_beta     1.001   2.7200   0.38280   2.614 0.008943 grpge       -1.172   0.3097   0.47130  -2.487 0.012880  final llk =  -80.30575  Iterations =  10  >  > ## Comparing loglikelihoods > logReg_fit$deviance/(-2) - ic_fit$llk [1] 2.643219e-12 

Note that the effect of grp is the same in each model, and the final log-likelihood differs only by numeric error. The baseline parameters (i.e. intercept and log_age for logistic regression, alpha and beta for the interval censored model) are different parameterizations so they are not equal.

So there you have it: using logistic regression is equivalent to fitting the proportional odds with a log-logistic baseline distribution. If you're okay with fitting this parametric model, logistic regression is quite reasonable. I do caution that with interval censored data, semi-parametric models are typically favored due to difficulty of assessing model fit, but if I truly thought there was no place for fully-parametric models I would have not included them in icenReg.

Similar Posts:

Rate this post

Leave a Comment