Is there a branch of statistics that deals with data for which **exact values are not known**, but for each individual, **we know either a maximum or minimum bound to the value**?

I suspect that my problem stems largely from the fact that I am struggling to articulate it in statistical terms, but hopefully an example will help to clarify:

Say there are two connected populations $A$ and $B$ such that, **at some point, members of $A$ may "transition" into $B$**, but the reverse is not possible. The timing of the transition is variable, but non-random. For example, $A$ could be "individuals without offspring" and $B$ "individuals with at least one offspring". I am interested in the age this progression occurs but I only have cross-sectional data. For any given individual, I can find out if they belong to $A$ or $B$. I also know the age of these individuals. For each individual in population $A$, I know that the age at transition will be GREATER THAN their current age. Likewise, for members of $B$, I know that the age at transition was LESS THAN their current age. But I don't know the exact values.

Say I have some other factor that I want to compare with the age of transition. For example, I want to know whether an individual's subspecies or body size affects the age of first offspring. I definitely have some useful information that should inform those questions: on average, of the individuals in $A$, older individuals will have a later transition. But **the information is imperfect**, particularly for younger individuals. And vice versa for population $B$.

**Are there established methods to deal with this sort of data**? I do not necessarily need a full method of how to carry out such an analysis, just some search terms or useful resources to start me off in the right place!

Caveats: I am making the simplifying assumption that transition from $A$ to $B$ is instantaneous. I am also prepared to assume that most individuals will at some point progress to $B$, assuming they live long enough. And I realise that longitutinal data would be very helpful, but assume that it is not available in this case.

Apologies if this is a duplicate, as I said, part of my problem is that I don't know what I should be searching for. For the same reason, please add other tags if appropriate.

Sample dataset: Ssp indicates one of two subspecies, $X$ or $Y$. Offspring indicates either no offspring ($A$) or at least one offspring ($B$)

` age ssp offsp 21 Y A 20 Y B 26 X B 33 X B 33 X A 24 X B 34 Y B 22 Y B 10 Y B 20 Y A 44 X B 18 Y A 11 Y B 27 X A 31 X B 14 Y B 41 X B 15 Y A 33 X B 24 X B 11 Y A 28 X A 22 X B 16 Y A 16 Y B 24 Y B 20 Y B 18 X B 21 Y B 16 Y B 24 Y A 39 X B 13 Y A 10 Y B 18 Y A 16 Y A 21 X A 26 X B 11 Y A 40 X B 8 Y A 41 X B 29 X B 53 X B 34 X B 34 X B 15 Y A 40 X B 30 X A 40 X B `

Edit: example dataset changed as it wasn't very representative

**Contents**hide

#### Best Answer

This is referred to as **current status data**. You get one cross sectional view of the data, and regarding the response, all you know is that at the observed age of each subject, the event (in your case: transitioning from A to B) has happened or not. This is a special case of **interval censoring**.

To formally define it, let $T_i$ be the (unobserved) true event time for subject $i$. Let $C_i$ the inspection time for subject $i$ (in your case: age at inspection). If $C_i < T_i$, the data are *right censored*. Otherwise, the data are *left censored*. We are interesting in modeling the distribution of $T$. For regression models, we are interested in modeling how that distribution changes with a set of covariates $X$.

To analyze this using interval censoring methods, you want to put your data into the general interval censoring format. That is, for each subject, we have the interval $(l_i, r_i)$, which represents the interval in which we know $T_i$ to be contained. So if subject $i$ is right censored at inspection time $c_i$, we would write $(c_i, infty)$. If it is left censored at $c_i$, we would represent it as $(0, c_i)$.

Shameless plug: if you want to use regression models to analyze your data, this can be done in R using `icenReg`

(I'm the author). In fact, in a similar question about current status data, the OP put up a nice demo of using icenReg. He starts by showing that ignoring the censoring part and using logistic regression leads to bias (important note: he is referring to using logistic regression *without adjusting for age*. More on this later.)

Another great package is `interval`

, which contains log-rank statistic tests, among other tools.

**EDIT:**

@EdM suggested using logistic regression to answer the problem. I was unfairly dismissive of this, saying that you would have to worry about the functional form of time. While I stand behind the statement that you should worry about the functional form of time, I realized that there was a very reasonable transformation that leads to a reasonable parametric estimator.

In particular, if we use log(time) as a covariate in our model with logistic regression, we end up with a proportional odds model with a log-logistic baseline.

To see this, first consider that the proportional odds regression model is defined as

$text{Odds}(t|X, beta) = e^{X^T beta} text{Odds}_o(t)$

where $text{Odds}_o(t)$ is the baseline odds of survival at time $t$. Note that the regression effects are the same as with logistic regression. So all we need to do now is show that the baseline distribution is log-logistic.

Now consider a logistic regression with log(Time) as a covariate. We then have

$P(Y = 1 | T = t) = frac{exp(beta_0 + beta_1 log(t))}{1 + exp(beta_0 + beta_1log(t))}$

With a little work, you can see this as the CDF of a log-logistic model (with a non-linear transformation of the parameters).

R demonstration that the fits are equivalent:

`> library(icenReg) > data(miceData) > > ## miceData contains current status data about presence > ## of tumors at sacrifice in two groups > ## in interval censored format: > ## l = lower end of interval, u = upper end > ## first three mice all left censored > > head(miceData, 3) l u grp 1 0 381 ce 2 0 477 ce 3 0 485 ce > > ## To fit this with logistic regression, > ## we need to extract age at sacrifice > ## if the observation is left censored, > ## this is the upper end of the interval > ## if right censored, is the lower end of interval > > age <- numeric() > isLeftCensored <- miceData$l == 0 > age[isLeftCensored] <- miceData$u[isLeftCensored] > age[!isLeftCensored] <- miceData$l[!isLeftCensored] > > log_age <- log(age) > resp <- !isLeftCensored > > > ## Fitting logistic regression model > logReg_fit <- glm(resp ~ log_age + grp, + data = miceData, family = binomial) > > ## Fitting proportional odds regression model with log-logistic baseline > ## interval censored model > ic_fit <- ic_par(cbind(l,u) ~ grp, + model = 'po', dist = 'loglogistic', data = miceData) > > summary(logReg_fit) Call: glm(formula = resp ~ log_age + grp, family = binomial, data = miceData) Deviance Residuals: Min 1Q Median 3Q Max -2.1413 -0.8052 0.5712 0.8778 1.8767 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 18.3526 6.7149 2.733 0.00627 ** log_age -2.7203 1.0414 -2.612 0.00900 ** grpge -1.1721 0.4713 -2.487 0.01288 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 196.84 on 143 degrees of freedom Residual deviance: 160.61 on 141 degrees of freedom AIC: 166.61 Number of Fisher Scoring iterations: 5 > summary(ic_fit) Model: Proportional Odds Baseline: loglogistic Call: ic_par(formula = cbind(l, u) ~ grp, data = miceData, model = "po", dist = "loglogistic") Estimate Exp(Est) Std.Error z-value p log_alpha 6.603 737.2000 0.07747 85.240 0.000000 log_beta 1.001 2.7200 0.38280 2.614 0.008943 grpge -1.172 0.3097 0.47130 -2.487 0.012880 final llk = -80.30575 Iterations = 10 > > ## Comparing loglikelihoods > logReg_fit$deviance/(-2) - ic_fit$llk [1] 2.643219e-12 `

Note that the effect of `grp`

is the same in each model, and the final log-likelihood differs only by numeric error. The baseline parameters (i.e. intercept and log_age for logistic regression, alpha and beta for the interval censored model) are different parameterizations so they are not equal.

So there you have it: using logistic regression is equivalent to fitting the proportional odds with a log-logistic baseline distribution. If you're okay with fitting this parametric model, logistic regression is quite reasonable. I do caution that with interval censored data, semi-parametric models are typically favored due to difficulty of assessing model fit, *but* if I truly thought there was no place for fully-parametric models I would have not included them in `icenReg`

.

### Similar Posts:

- Solved – Why use odds and not probability in logistic regression
- Solved – interval censored survival analysis with time dependent covariates
- Solved – Relationship between logit and odds ratios
- Solved – Logistic regression intercept representing baseline probability
- Solved – Relationship between $beta_1$ and odds in simple logistic regression