Solved – What data structure is necessary for survival analysis

I'm relatively new to survival analysis and try to get my data in the right shape.

I have two tables both concerning the observed individuals. If I just would use one of the tables, I would have continuous information on each individual without any overlapping periods.

As I however also need the information stored in the other table, it is necessary to merge the two tables. But then the episodes will be overlapping in some cases.

I give you an example as illustration:

Table 1:

``ID: 1 start: 2000-01-01 end: 2002-12-31 state: A ``

Table 2:

``ID: 1 start: 2002-01-01 end: 2002-04-15 state: Z ``

To do survival analysis (in Stata or R) does it matter if there are overlaps?

If it does, do you have any suggestions on how to remove the overlaps?

Contents

Assuming that by "parametric model" the OP means fully parametric, then this sounds like a question about the appropriate data structure for discrete time survival analysis (aka discrete time event history) models such as logit (1), probit (2), or complimentary log-log (3) hazard models, then the appropriate answer is that the data typically need to be structured in a person-period format.

1. $$h_{t} = frac{e^{mathbf{BX}}}{1 + e^{mathbf{BX}}}$$
2. $$h_{t} = Phi(mathbf{BX})$$
3. $$h_{t} = 1 – e^{-e^{mathbf{BX}}}$$

where $$mathbf{BX}$$ are the parameters and predictors in the model. Often discrete time survival analysis models will include dummy variables for each time period (see below) and also often include time period itself, or even functions of it, as a variable.

Here's what a person-period data set looks like:

``ID period y x1 x2 x3 t1 t2 t3 . . . tT 1  1      0 1  3  12 1  0  0  . . . 0 1  2      0 1  0  12 0  1  0  . . . 0 1  3      1 1  9  12 0  0  1  . . . 0 2  1      0 0  4  6  1  0  0  . . . 0 3  1      0 1  0  17 1  0  0  . . . 0 3  2      0 1  3  17 0  1  0  . . . 0 3  3      0 1  3  17 0  0  1  . . . 0 etc. ``

First of all notice both `ID` and `period` which define the hierarchical period of observation nested in person structure of these data. Also notice that `x2` is time varying (i.e. within the same individual it can take different values in different periods), and that `x1` and `x3` are static; understand that the model is agnostic as to whether predictors are time-varying or static. Finally examine the relationship between period and the indicator variables for time/period (i.e. `t1` through `tT`).

Often times you will receive data in a person-time format such as this:

``ID TimeToEvent Censored x1 x2t1 x2t2 . . . x2tT x3, ``

and will need to transform the data appropriately. Here `TimeToEvent` measures how many periods each subject was observed while in the study, and `Censored` indicates whether or not the subject left the study without experiencing the event (i.e. whether that subject was right censored). In your data `TimeToEvent` probably equals `end``start`, and `Censored` is certainly some function of `state`.

There are often tools available for transforming data such as these. For example, in Stata, see `net describe dthaz, from(http://alexisdinno.com/stata)`

Rate this post